Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/mono/mono.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorMiguel de Icaza <miguel@gnome.org>2003-04-05 23:21:32 +0400
committerMiguel de Icaza <miguel@gnome.org>2003-04-05 23:21:32 +0400
commit5a710fa941e8d9c64a6228fe173cf50976eed05e (patch)
tree300e045388fa0e8eabad7f6ccee0325d5a0d0149 /docs
parent64655bb5ff8963dc61d8594bd2bfe584c6efbeb3 (diff)
New compilation engine for Mono
svn path=/trunk/mono/; revision=13205
Diffstat (limited to 'docs')
-rw-r--r--docs/aot-compiler.txt44
-rw-r--r--docs/local-regalloc.txt208
-rw-r--r--docs/mini-doc.txt628
-rw-r--r--docs/opcode-decomp.txt113
4 files changed, 993 insertions, 0 deletions
diff --git a/docs/aot-compiler.txt b/docs/aot-compiler.txt
new file mode 100644
index 00000000000..ab1af90d965
--- /dev/null
+++ b/docs/aot-compiler.txt
@@ -0,0 +1,44 @@
+Mono Ahead Of Time Compiler
+===========================
+
+The new mono JIT has sophisticated optimization features. It uses SSA and has a
+pluggable architecture for further optimizations. This makes it possible and
+efficient to use the JIT also for AOT compilation.
+
+
+* file format: We use the native object format of the platform. That way it is
+ possible to reuse existing tools like objdump and the dynamic loader. All we
+ need is a working assembler, i.e. we write out a text file which is then
+ passed to gas (the gnu assembler) to generate the object file.
+
+* file names: we simply add ".so" to the generated file. For example:
+ basic.exe -> basic.exe.so
+ corlib.dll -> corlib.dll.so
+
+* staring the AOT compiler: mini --aot assembly_name
+
+The following things are saved in the object file:
+
+* version infos:
+
+* native code: this is labeled with method_XXXXXXXX: where XXXXXXXX is the
+ hexadecimal token number of the method.
+
+* additional informations needed by the runtime: For example we need to store
+ the code length and the exception tables. We also need a way to patch
+ constants only available at runtime (for example vtable and class
+ addresses). This is stored i a binary blob labeled method_info_XXXXXXXX:
+
+PROBLEMS:
+
+- all precompiled methods must be domain independent, or we add patch infos to
+ patch the target doamin.
+
+- the main problem is how to patch runtime related addresses, for example:
+
+ - current application domain
+ - string objects loaded with LDSTR
+ - address of MonoClass data
+ - static field offsets
+ - method addreses
+ - virtual function and interface slots
diff --git a/docs/local-regalloc.txt b/docs/local-regalloc.txt
new file mode 100644
index 00000000000..a6e523557fe
--- /dev/null
+++ b/docs/local-regalloc.txt
@@ -0,0 +1,208 @@
+
+* Proposal for the local register allocator
+
+ The local register allocator deals with allocating registers
+ for temporaries inside a single basic block, while the global
+ register allocator is concerned with method-wide allocation of
+ variables.
+ The global register allocator uses callee-saved register for it's
+ purpouse so that there is no need to save and restore these registers
+ at call sites.
+
+ There are a number of issues the local allocator needs to deal with:
+ *) some instructions expect operands in specific registers (for example
+ the shl instruction on x86, or the call instruction with thiscall
+ convention, or the equivalent call instructions on other architectures,
+ such as the need to put output registers in %oX on sparc)
+ *) some instructions deliver results only in specific registers (for example
+ the div instruction on x86, or the call instructionson on almost all
+ the architectures).
+ *) it needs to know what registers may be clobbered by an instruction
+ (such as in a method call)
+ *) it should avoid excessive reloads or stores to improve performance
+
+ While which specific instructions have limitations is architecture-dependent,
+ the problem shold be solved in an arch-independent way to reduce code duplication.
+ The register allocator will be 'driven' by the arch-dependent code, but it's
+ implementation should be arch-independent.
+
+ To improve the current local register allocator, we need to
+ keep more state in it than the current setup that only keeps busy/free info.
+
+ Possible state information is:
+
+ free: the resgister is free to use and it doesn't contain useful info
+ freeable: the register contains data loaded from a local (there is
+ also info about _which_ local it contains) as a result from previous
+ instructions (like, there was a store from the register to the local)
+ moveable: it contains live data that is needed in a following instruction, but
+ the contents may be moved to a different register
+ busy: the register contains live data and it is placed there because
+ the following instructions need it exactly in that register
+ allocated: the register is used by the global allocator
+
+ The local register allocator will have the following interfaces:
+
+ int get_register ();
+ Searches for a register in the free state. If it doesn't find it,
+ searches for a freeable register. Sets the status to moveable.
+ Looking for a 'free' register before a freeable one should allow for
+ removing a few redundant loads (though I'm still unsure if such
+ things should be delegated entirely to the peephole pass).
+
+ int get_register_force (int reg);
+ Returns 'reg' if it is free or freeable. If it is moveable, it moves it
+ to another free or freeable register.
+ Sets the status of 'reg' to busy.
+
+ void set_register_freeable (int reg);
+ Sets the status of 'reg' to freeable.
+
+ void set_register_free (int reg);
+ Sets the status of 'reg' to free.
+
+ void will_clobber (int reg);
+ Spills the register to the stack. Sets the status to freeable.
+ After the clobbering has occurred, set the status to free.
+
+ void register_unspill (int reg);
+ Un-spills register reg and sets the status to moveable.
+
+ FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
+
+ Note: the register allocator will insert instructions in the basic block
+ during it's operation.
+
+* Examples
+
+ Given the tree (on x86 the right argument to shl needs to be in ecx):
+
+ store (local1, shl (local1, call (some_arg)))
+
+ At the start of the basic block, the registers are set to the free state.
+ The sequence of instructions may be:
+ instruction register status -> [%eax %ecx %edx]
+ start free free free
+ eax = load local1 mov free free
+ /* call clobbers eax, ecx, edx */
+ spill eax free free free
+ call mov free free
+ /* now eax contains the right operand of the shl */
+ mov %eax -> %ecx free busy free
+ un-spill mov busy free
+ shl %cl, %eax mov free free
+
+ The resulting x86 code is:
+ mov $fffc(%ebp), %eax
+ mov %eax, $fff0(%ebp)
+ push some_arg
+ call func
+ mov %eax, %ecx
+ mov $fff0(%ebp), %eax
+ shl %cl, %eax
+
+ Note that since shl could operate directly on memory, we could have:
+
+ push some_arg
+ call func
+ mov %eax, %ecx
+ shl %cl, $fffc(%ebp)
+
+ The above example with loading the operand in a register is just to complicate
+ the example and show that the algorithm should be able to handle it.
+
+ Let's take another example with the this-call call convention (the first argument
+ is passed in %ecx).
+ In this case, will_clobber() will be called only on %eax and %edx, while %ecx
+ will be allocated with get_register_force ().
+ Note: when a register is allocated with get_register_force(), it should be set
+ to a different state as soon as possible.
+
+ store (local1, shl (local1, this-call (local1)))
+
+ instruction register status -> [%eax %ecx %edx]
+ start free free free
+ eax = load local1 mov free free
+ /* force load in %ecx */
+ ecx = load local1 mov busy free
+ spill eax free busy free
+ call mov free free
+ /* now eax contains the right operand of the shl */
+ mov %eax -> %ecx free busy free
+ un-spill mov busy free
+ shl %cl, %eax mov free free
+
+ What happens when a register that we need to allocate with get_register_force ()
+ contains an operand for the next instruction?
+
+ instruction register status -> [%eax %ecx %edx]
+ eax = load local0 mov free free
+ ecx = load local1 mov mov free
+ get_register_force (ecx) here.
+ We have two options:
+ mov %ecx, %edx
+ or:
+ spill %ecx
+ The first option is way better (and allows the peephole pass to
+ just load the value in %edx directly, instead of loading first to %ecx).
+ This doesn't work, though, if the instruction clobbers the %edx register
+ (like in a this-call). So, we first need to clobber the registers
+ (so the state of %ecx changes to freebale and there is no issue
+ with get_register_force ()).
+ What if an instruction both clobbers a register and requires it as
+ an operand? Lets' take the x86 idiv instruction as an example: it
+ requires the dividend in edx:eax and returns the result in eax,
+ with the modulus in edx.
+
+ store (local1, div (local1, local2))
+
+ instruction register status -> [%eax %ecx %edx]
+ eax = load local0 mov free free
+ will_clobber eax, edx free mov free
+ force mov %ecx, %eax busy free free
+ set %edx busy free busy
+ idiv mov free free
+
+ Note: edx is set to free after idiv, because the modulus is not needed
+ (if it was a rem, eax would have been freed).
+ If we load the divisor before will_clobber(), we'll have to spill
+ eax and reload it later. If we load it just after the idiv, there is no issue.
+ In any case, the algorithm should give the correct results and allow the operation.
+
+ Working recursively on the isntructions there shouldn't be huge issues
+ with this algorithm (though, of course, it's not optimal and it may
+ introduce excessive spills or register moves). The advantage over the current
+ local reg allocator is that:
+ 1) the number of spills/moves would be smaller anyway
+ 2) a separate peephole pass could be able to eliminate reg moves
+ 3) we'll be able to remove the 'forced' spills we currently do with
+ the return value of method calls
+
+* Issues
+
+ How to best integrate such a reg allocator with the burg stuff.
+
+ Think about a call os sparc with two arguments: they got into %o0 and %o1
+ and each of them sets the register as busy. But what if the values to put there
+ are themselves the result of a call? %o0 is no problem, but for all the
+ next argument n the above algorithm would spill all the 0...n-1 registers...
+
+* Papers
+
+ More complex solutions to the local register allocator problem:
+ http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
+
+ Combining register allocation and instruction scheduling:
+ http://citeseer.nj.nec.com/motwani95combining.html
+
+ More on LRA euristics:
+ http://citeseer.nj.nec.com/liberatore97hardness.html
+
+ Linear-time optimal code scheduling for delayedload architectures
+ http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
+
+ Precise Register Allocation for Irregular Architectures
+ http://citeseer.nj.nec.com/kong98precise.html
+
+ Allocate registers first to subtrees that need more of them.
+ http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf
diff --git a/docs/mini-doc.txt b/docs/mini-doc.txt
new file mode 100644
index 00000000000..cfe1a91d365
--- /dev/null
+++ b/docs/mini-doc.txt
@@ -0,0 +1,628 @@
+
+ A new JIT compiler for the Mono Project
+
+ Miguel de Icaza (miguel@{ximian.com,gnome.org}),
+
+
+* Abstract
+
+ Mini is a new compilation engine for the Mono runtime. The
+ new engine is designed to bring new code generation
+ optimizations, portability and precompilation.
+
+ In this document we describe the design decisions and the
+ architecture of the new compilation engine.
+
+* Introduction
+
+ First we discuss the overall architecture of the Mono runtime,
+ and how code generation fits into it; Then we discuss the
+ development and basic architecture of our first JIT compiler
+ for the ECMA CIL framework. The next section covers the
+ objectives for the work on the new JIT compiler, then we
+ discuss the new features available in the new JIT compiler,
+ and finally a technical description of the new code generation
+ engine.
+
+* Architecture of the Mono Runtime
+
+ The Mono runtime is an implementation of the ECMA Common
+ Language Infrastructure (CLI), whose aim is to be a common
+ platform for executing code in multiple languages.
+
+ Languages that target the CLI generate images that contain
+ code in high-level intermediate representation called the
+ "Common Intermediate Language". This intermediate language is
+ rich enough to allow for programs and pre-compiled libraries
+ to be reflected. The execution environment allows for an
+ object oriented execution environment with single inheritance
+ and multiple interface implementations.
+
+ This runtime provides a number of services for programs that
+ are targeted to it: Just-in-Time compilation of CIL code into
+ native code, garbage collection, thread management, I/O
+ routines, single, double and decimal floating point,
+ asynchronous method invocation, application domains, and a
+ framework for building arbitrary RPC systems (remoting) and
+ integration with system libraries through the Platform Invoke
+ functionality.
+
+ The focus of this document is on the services provided by the
+ Mono runtime to transform CIL bytecodes into code that is
+ native to the underlying architecture.
+
+ The code generation interface is a set of macros that allow a
+ C programmer to generate code on the fly, this is done
+ through a set of macros found in the mono/jit/arch/ directory.
+ These macros are used by the JIT compiler to generate native
+ code.
+
+ The platform invocation code is interesting, as it generates
+ CIL code on the fly to marshal parameters, and then this
+ code is in turned processed by the JIT engine.
+
+* Previous Experiences
+
+ Mono has built a JIT engine, which has been used to bootstrap
+ Mono since January, 2002. This JIT engine has reasonable
+ performance, and uses an tree pattern matching instruction
+ selector based on the BURS technology. This JIT compiler was
+ designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
+
+ The existing JIT compiler has three phases:
+
+ * Re-creation of the semantic tree from CIL
+ byte-codes.
+
+ * Instruction selection, with a cost-driven
+ engine.
+
+ * Code generation and register allocation.
+
+ It is also hooked into the rest of the runtime to provide
+ services like marshaling, just-in-time compilation and
+ invocation of "internal calls".
+
+ This engine constructed a collection of trees, which we
+ referred to as the "forest of trees", this forest is created by
+ "hydrating" the CIL instruction stream.
+
+ The first step was to identify the basic blocks on the method,
+ and computing the control flow graph (cfg) for it. Once this
+ information was computed, a stack analysis on each basic block
+ was performed to create a forest of trees for each one of
+ them.
+
+ So for example, the following statement:
+
+ int a, b;
+ ...
+ b = a + 1;
+
+ Which would be represented in CIL as:
+
+ ldloc.0
+ ldc.i4.1
+ add
+ stloc.1
+
+ After the stack analysis would create the following tree:
+
+ (STIND_I4 ADDR_L[EBX|2] (
+ ADD (LDIND_I4 ADDR_L[ESI|1])
+ CONST_I4[1]))
+
+ This tree contains information from the stack analysis: for
+ instance, notice that the operations explicitly encode the
+ data types they are operating on, there is no longer an
+ ambiguity on the types, because this information has been
+ inferred.
+
+ At this point the JIT would pass the constructed forest of
+ trees to the architecture-dependant JIT compiler.
+
+ The architecture dependent code then performed register
+ allocation (optionally using linear scan allocation for
+ variables, based on life analysis).
+
+ Once variables had been assigned, a tree pattern matching with
+ dynamic programming is used (the tree pattern matcher is
+ custom build for each architecture, using a code
+ generator: monoburg). The instruction selector used cost
+ functions to select the best instruction patterns.
+
+ The instruction selector is able to produce instructions that
+ take advantage of the x86 instruction indexing instructions
+ for example.
+
+ One problem though is that the code emitter and the register
+ allocator did not have any visibility outside the current
+ tree, which meant that some redundant instructions were
+ generated. A peephole optimizer with this architecture was
+ hard to write, given the tree-based representation that is
+ used.
+
+ This JIT was functional, but it did not provide a good
+ architecture to base future optimizations on. Also the
+ line between architecture neutral and architecture
+ specific code and optimizations was hard to draw.
+
+ The JIT engine supported two code generation modes to support
+ the two optimization modes for applications that host multiple
+ application domains: generate code that will be shared across
+ application domains, or generate code that will not be shared
+ across application domains.
+
+* Objectives of the new JIT engine.
+
+ We wanted to support a number of features that were missing:
+
+ * Ahead-of-time compilation.
+
+ The idea is to allow developers to pre-compile their code
+ to native code to reduce startup time, and the working
+ set that is used at runtime in the just-in-time compiler.
+
+ Although in Mono this has not been a visible problem, we
+ wanted to pro-actively address this problem.
+
+ When an assembly (a Mono/.NET executable) is installed in
+ the system, it would then be possible to pre-compile the
+ code, and have the JIT compiler tune the generated code
+ to the particular CPU on which the software is
+ installed.
+
+ This is done in the Microsoft.NET world with a tool
+ called ngen.exe
+
+ * Have a good platform for doing code optimizations.
+
+ The design called for a good architecture that would
+ enable various levels of optimizations: some
+ optimizations are better performed on high-level
+ intermediate representations, some on medium-level and
+ some at low-level representations.
+
+ Also it should be possible to conditionally turn these on
+ or off. Some optimizations are too expensive to be used
+ in just-in-time compilation scenarios, but these
+ expensive optimizations can be turned on for
+ ahead-of-time compilations or when using profile-guided
+ optimizations on a subset of the executed methods.
+
+ * Reduce the effort required to port the Mono code
+ generator to new architectures.
+
+ For Mono to gain wide adoption in the Unix world, it is
+ necessary that the JIT engine works in most of today's
+ commercial hardware platforms.
+
+* Features of the new JIT engine.
+
+ The new JIT engine was architected by Dietmar Maurer and Paolo
+ Molaro, based on the new objectives.
+
+ Mono provides a number of services to applications running
+ with the new JIT compiler:
+
+ * Just-in-Time compilation of CLI code into native code.
+
+ * Ahead-of-Time compilation of CLI code, to reduce
+ startup time of applications.
+
+ A number of software development features are also available:
+
+ * Execution time profiling (--profile)
+
+ Generates a report of the times consumed by routines,
+ as well as the invocation times, as well as the
+ callers.
+
+ * Memory usage profiling (--profile)
+
+ Generates a report of the memory usage by a program
+ that is ran under the Mono JIT.
+
+ * Code coverage (--coverage)
+
+ * Execution tracing.
+
+ People who are interested in developing and improving the Mini
+ JIT compiler will also find a few useful routines:
+
+ * Compilation times
+
+ This is used to time the execution time for the JIT
+ when compiling a routine.
+
+ * Control Flow Graph and Dominator Tree drawing.
+
+ These are visual aids for the JIT developer: they
+ render representations of the Control Flow graph, and
+ for the more advanced optimizations, they draw the
+ dominator tree graph.
+
+ This requires Dot (from the graphwiz package) and Ghostview.
+
+ * Code generator regression tests.
+
+ The engine contains support for running regression
+ tests on the virtual machine, which is very helpful to
+ developers interested in improving the engine.
+
+ * Optimization benchmark framework.
+
+ The JIT engine will generate graphs that compare
+ various benchmarks embedded in an assembly, and run the
+ various tests with different optimization flags.
+
+ This requires Perl, GD::Graph.
+
+* Flexibility
+
+ This is probably the most important component of the new code
+ generation engine. The internals are relatively easy to
+ replace and update, even large passes can be replaced and
+ implemented differently.
+
+* New code generator
+
+ Compiling a method begins with the `mini_method_to_ir' routine
+ that converts the CIL representation into a medium
+ intermediate representation.
+
+ The mini_method_to_ir routine performs a number of operations:
+
+ * Flow analysis and control flow graph computation.
+
+ Unlike the previous version, stack analysis and control
+ flow graphs are computed in a single pass in the
+ mini_method_to_ir function, this is done for performance
+ reasons: although the complexity increases, the benefit
+ for a JIT compiler is that there is more time available
+ for performing other optimizations.
+
+ * Basic block computation.
+
+ mini_method_to_ir populates the MonoCompile structure
+ with an array of basic blocks each of which contains
+ forest of trees made up of MonoInst structures.
+
+ * Inlining
+
+ Inlining is no longer restricted to methods containing
+ one single basic block, instead it is possible to inline
+ arbitrary complex methods.
+
+ The heuristics to choose what to inline are likely going
+ to be tuned in the future.
+
+ * Method to opcode conversion.
+
+ Some method call invocations like `call Math.Sin' are
+ transformed into an opcode: this transforms the call
+ into a semantically rich node, which is later inline
+ into an FPU instruction.
+
+ Various Array methods invocations are turned into
+ opcodes as well (The Get, Set and Address methods)
+
+ * Tail recursion elimination
+
+ Basic blocks ****
+
+ The MonoInst structure holds the actual decoded instruction,
+ with the semantic information from the stack analysis.
+ MonoInst is interesting because initially it is part of a tree
+ structure, here is a sample of the same tree with the new JIT
+ engine:
+
+ (stind.i4 regoffset[0xffffffd4(%ebp)]
+ (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
+ iconst[1]))
+
+ This is a medium-level intermediate representation (MIR).
+
+ Some complex opcodes are decomposed at this stage into a
+ collection of simpler opcodes. Not every complex opcode is
+ decomposed at this stage, as we need to preserve the semantic
+ information during various optimization phases.
+
+ For example a NEWARR opcode carries the length and the type of
+ the array that could be used later to avoid type checking or
+ array bounds check.
+
+ There are a number of operations supported on this
+ representation:
+
+ * Branch optimizations.
+
+ * Variable liveness.
+
+ * Loop optimizations: the dominator trees are
+ computed, loops are detected, and their nesting
+ level computed.
+
+ * Conversion of the method into static single assignment
+ form (SSA form).
+
+ * Dead code elimination.
+
+ * Constant propagation.
+
+ * Copy propagation.
+
+ * Constant folding.
+
+ Once the above optimizations are optionally performed, a
+ decomposition phase is used to turn some complex opcodes into
+ internal method calls. In the initial version of the JIT
+ engine, various operations on longs are emulated instead of
+ being inlined. Also the newarr invocation is turned into a
+ call to the runtime.
+
+ At this point, after computing variable liveness, it is
+ possible to use the linear scan algorithm for allocating
+ variables to registers. The linear scan pass uses the
+ information that was previously gathered by the loop nesting
+ and loop structure computation to favor variables in inner
+ loops.
+
+ Stack space is then reserved for the local variables and any
+ temporary variables generated during the various
+ optimizations.
+
+** Instruction selection
+
+ At this point, the BURS instruction selector is invoked to
+ transform the tree-based representation into a list of
+ instructions. This is done using a tree pattern matcher that
+ is generated for the architecture using the `monoburg' tool.
+
+ Monoburg takes as input a file that describes tree patterns,
+ which are matched against the trees that were produced by the
+ engine in the previous stages.
+
+ The pattern matching might have more than one match for a
+ particular tree. In this case, the match selected is the one
+ whose cost is the smallest. A cost can be attached to each
+ rule, and if no cost is provided, the implicit cost is one.
+ Smaller costs are selected over higher costs.
+
+ The cost function can be used to select particular blocks of
+ code for a given architecture, or by using a prohibitive high
+ number to avoid having the rule match.
+
+ The various rules that our JIT engine uses transform a tree of
+ MonoInsts into a list of monoinsts:
+
+ +-----------------------------------------------------------+
+ | Tree List |
+ | of ===> Instruction selection ===> of |
+ | MonoInst MonoInst. |
+ +-----------------------------------------------------------+
+
+ During this process various "types" of MonoInst kinds
+ disappear and turned into lower-level representations. The
+ JIT compiler just happens to reuse the same structure (this is
+ done to reduce memory usage and improve memory locality).
+
+ The instruction selection rules are split in a number of
+ files, each one with a particular purpose:
+
+ inssel.brg
+ Contains the generic instruction selection
+ patterns.
+
+ inssel-x86.brg
+ Contains x86 specific rules.
+
+ inssel-ppc.brg
+ Contains PowerPC specific rules.
+
+ inssel-long32.brg
+ burg file for 64bit instructions on 32bit architectures.
+
+ inssel-long.brg
+ burg file for 64bit architectures.
+
+ inssel-float.brg
+ burg file for floating point instructions
+
+ For a given build, a set of those files would be included.
+ For example, for the build of Mono on the x86, the following
+ set is used:
+
+ inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
+
+** Native method generation
+
+ The native method generation has a number of steps:
+
+ * Architecture specific register allocation.
+
+ The information about loop nesting that was
+ previously gathered is used here to hint the
+ register allocator.
+
+ * Generating the method prolog/epilog.
+
+ * Optionally generate code to introduce tracing facilities.
+
+ * Hooking into the debugger.
+
+ * Performing any pending fixups.
+
+ * Code generation.
+
+*** Code Generation
+
+ The actual code generation is contained in the architecture
+ specific portion of the compiler. The input to the code
+ generator is each one of the basic blocks with its list of
+ instructions that were produced in the instruction selection
+ phase.
+
+ During the instruction selection phase, virtual registers are
+ assigned. Just before the peephole optimization is performed,
+ physical registers are assigned.
+
+ A simple peephole and algebraic optimizer is ran at this
+ stage.
+
+ The peephole optimizer removes some redundant operations at
+ this point. This is possible because the code generation at
+ this point has visibility into the basic block that spans the
+ original trees.
+
+ The algebraic optimizer performs some simple algebraic
+ optimizations that replace expensive operations with cheaper
+ operations if possible.
+
+ The rest of the code generation is fairly simple: a switch
+ statement is used to generate code for each of the MonoInsts
+
+ We always try to allocate code in sequence, instead of just using
+ malloc. This way we increase spatial locality which gives a massive
+ speedup on most architectures.
+
+*** Ahead of Time compilation
+
+ Ahead-of-Time compilation is a new feature of our new
+ compilation engine. The compilation engine is shared by the
+ Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
+ (AOT).
+
+ The difference is on the set of optimizations that are turned
+ on for each mode: Just-in-Time compilation should be as fast
+ as possible, while Ahead-of-Time compilation can take as long
+ as required, because this is not done at a time criticial
+ time.
+
+ With AOT compilation, we can afford to turn all of the
+ computationally expensive optimizations on.
+
+ After the code generation phase is done, the code and any
+ required fixup information is saved into a file that is
+ readable by "as" (the native assembler available on all
+ systems). This assembly file is then passed to the native
+ assembler, which generates a loadable module.
+
+ At execution time, when an assembly is loaded from the disk,
+ the runtime engine will probe for the existance of a
+ pre-compiled image. If the pre-compiled image exists, then it
+ is loaded, and the method invocations are resolved to the code
+ contained in the loaded module.
+
+ The code generated under the AOT scenario is slightly
+ different than the JIT scenario. It generates code that is
+ application-domain relative and that can be shared among
+ multiple thread.
+
+ This is the same code generation that is used when the runtime
+ is instructed to maximize code sharing on a multi-application
+ domain scenario.
+
+* SSA-based optimizations
+
+ SSA form simplifies many optimization because each variable has exactly
+ one definition site. All uses of a variable are "dominated" by its
+ definition, which enables us to implement algorithm like:
+
+ * conditional constant propagation
+
+ * array bound check removal
+
+ * dead code elimination
+
+ And we can implement those algorithm in a efficient way using SSA.
+
+
+* Register allocation.
+
+ Global register allocation is performed on the medium
+ intermediate representation just before instruction selection
+ is performed on the method. Local register allocation is
+ later performed at the basic-block level on the
+
+ Global register allocation uses the following input:
+
+ 1) set of register-sized variables that can be allocated to a
+ register (this is an architecture specific setting, for x86
+ these registers are the callee saved register ESI, EDI and
+ EBX).
+
+ 2) liveness information for the variables
+
+ 3) (optionally) loop info to favour variables that are used in
+ inner loops.
+
+ During instruction selection phase, symbolic registers are
+ assigned to temporary values in expressions.
+
+ Local register allocation assigns hard registers to the
+ symbolic registers, and it is performed just before the code
+ is actually emitted and is performed at the basic block level.
+ A CPU description file describes the input registers, output
+ registers, fixed registers and clobbered registers by each
+ operation.
+
+
+----------
+* Bootstrap
+
+ The Mini bootstrap parses the arguments passed on the command
+ line, and initializes the JIT runtime. Each time the
+ mini_init() routine is invoked, a new Application Domain will
+ be returned.
+
+* Signal handlers
+
+ mono_runtime_install_handlers
+
+* BURG Code Generator Generator
+
+ monoburg was written by Dietmar Maurer. It is based on the
+ papers from Christopher W. Fraser, Robert R. Henry and Todd
+ A. Proebsting: "BURG - Fast Optimal Instruction Selection and
+ Tree Parsing" and "Engineering a Simple, Efficient Code
+ Generator Generator".
+
+ The original BURG implementation is unable to work on DAGs, instead only
+ trees are allowed. Our monoburg implementations is able to generate tree
+ matcher which works on DAGs, and we use this feature in the new
+ JIT. This simplifies the code because we can directly pass DAGs and
+ don't need to convert them to trees.
+
+* Future
+
+ Profile-based optimization is something that we are very
+ interested in supporting. There are two possible usage
+ scenarios:
+
+ * Based on the profile information gathered during
+ the execution of a program, hot methods can be compiled
+ with the highest level of optimizations, while bootstrap
+ code and cold methods can be compiled with the least set
+ of optimizations and placed in a discardable list.
+
+ * Code reordering: this profile-based optimization would
+ only make sense for pre-compiled code. The profile
+ information is used to re-order the assembly code on disk
+ so that the code is placed on the disk in a way that
+ increments locality.
+
+ This is the same principle under which SGI's cord program
+ works.
+
+ The nature of the CIL allows the above optimizations to be
+ easy to implement and deploy. Since we live and define our
+ universe for these things, there are no interactions with
+ system tools required, nor upgrades on the underlying
+ infrastructure required.
+
+ Instruction scheduling is important for certain kinds of
+ processors, and some of the framework exists today in our
+ register allocator and the instruction selector to cope with
+ this, but has not been finished. The instruction selection
+ would happen at the same time as local register allocation. \ No newline at end of file
diff --git a/docs/opcode-decomp.txt b/docs/opcode-decomp.txt
new file mode 100644
index 00000000000..48968d17ab9
--- /dev/null
+++ b/docs/opcode-decomp.txt
@@ -0,0 +1,113 @@
+
+* How to handle complex IL opcodes in an arch-independent way
+
+ Many IL opcodes are very simple: add, ldind etc.
+ Such opcodes can be implemented with a single cpu instruction
+ in most architectures (on some, a group of IL instructions
+ can be converted to a single cpu op).
+ There are many IL opcodes, though, that are more complex, but
+ can be expressed as a series of trees or a single tree of
+ simple operations. Such simple operations are architecture-independent.
+ It makes sense to decompose such complex IL instructions in their
+ simpler equivalent so that we gain in several ways:
+ *) porting effort is easier, because only the simple instructions
+ need to be implemented in arch-specific code
+ *) we could apply BURG rules to the trees and do pattern matching
+ on them to optimize the expressions according to the host cpu
+
+ The issue is: where do we do such conversion from coarse opcodes to
+ simple expressions?
+
+* Doing the conversion in method_to_ir ()
+
+ Some of these conversions can certainly be done in method_to_ir (),
+ but it's not always easy to decide which are better done there and
+ which in a different pass.
+ For example, let's take ldlen: in the mono implementation, ldlen
+ can be simply implemented with a load from a fixed position in the
+ array object:
+
+ len = [reg + maxlen_offset]
+
+ However, ldlen carries also semantics information: the result is the
+ length of the array, and since in the CLR arrays are of fixed size,
+ this information can be useful to later do bounds check removal.
+ If we convert this opcode in method_to_ir () we lost some useful
+ information for further optimizations.
+
+ In some other ways, decomposing an opcode in method_to_ir() may
+ allow for better optimizations later on (need to come up with an
+ example here ...).
+
+* Doing the conversion in inssel.brg
+
+ Some conversion may be done inside the burg rules: this has the
+ disadvantage that the instruction selector is not run again on
+ the resulting expression tree and we could miss some optimization
+ (this is what effectively happens with the coarse opcodes in the old
+ jit). This may also interfere with an efficient local register allocator.
+ It may be possible to add an extension in monoburg that allows a rule
+ such as:
+
+ recheck: LDLEN (reg) {
+ create an expression tree representing LDLEN
+ and return it
+ }
+
+ When the monoburg label process gets back a recheck, it will run
+ the labeling again on the resulting expression tree.
+ If this is possible at all (and in an efficient way) is a
+ question for dietmar:-)
+ It should be noted, though, that this may not always work, since
+ some complex IL opcodes may require a series of expression trees
+ and handling such cases in monoburg could become quite hairy.
+ For example, think of opcode that need to do multiple actions on the
+ same object: this basically means a DUP...
+ On the other end, if a complex opcode needs a DUP, monoburg doesn't
+ actually need to create trees if it emits the instructions in
+ the correct sequence and maintains the right values in the registers
+ (usually the values that need a DUP are not changed...). How
+ this integrates with the current register allocator is not clear, since
+ that assigns registers based on the rule, but the instructions emitted
+ by the rules may be different (this already happens with the current JIT
+ where a MULT is replaced with lea etc...).
+
+* Doing it in a separate pass.
+
+ Doing the conversion in a separate pass over the instructions
+ is another alternative. This can be done right after method_to_ir ()
+ or after the SSA pass (since the IR after the SSA pass should look
+ almost like the IR we get back from method_to_ir ()).
+
+ This has the following advantages:
+ *) monoburg will handle only the simple opcodes (makes porting easier)
+ *) the instruction selection will be run on all the additional trees
+ *) it's easier to support coarse opcodes that produce multiple expression
+ trees (and apply the monoburg selector on all of them)
+ *) the SSA optimizer will see the original opcodes and will be able to use
+ the semantic info associated with them
+
+ The disadvantage is that this is a separate pass on the code and
+ it takes time (how much has not been measured yet, though).
+
+ With this approach, we may also be able to have C implementations
+ of some of the opcodes: this pass would insert a function call to
+ the C implementation (for example in the cases when first porting
+ to a new arch and implemenating some stuff may be too hard in asm).
+
+* Extended basic blocks
+
+ IL code needs a lot of checks, bounds checks, overflow checks,
+ type checks and so on. This potentially increases by a lot
+ the number of basic blocks in a control flow graph. However,
+ all such blocks end up with a throw opcode that gives control to the
+ exception handling mechanism.
+ After method_to_ir () a MonoBasicBlock can be considered a sort
+ of extended basic block where the additional exits don't point
+ to basic blocks in the same procedure (at least when the method
+ doesn't have exception tables).
+ We need to make sure the passes following method_to_ir () can cope
+ with such kinds of extended basic blocks (especially the passes
+ that we need to apply to all the methods: as a start, we could
+ skip SSA optimizations for methods with exception clauses...)
+