Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/mono/mono.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
Diffstat (limited to 'docs/mini-doc.txt')
-rw-r--r--docs/mini-doc.txt771
1 files changed, 771 insertions, 0 deletions
diff --git a/docs/mini-doc.txt b/docs/mini-doc.txt
new file mode 100644
index 00000000000..bd4b489777e
--- /dev/null
+++ b/docs/mini-doc.txt
@@ -0,0 +1,771 @@
+
+ A new JIT compiler for the Mono Project
+
+ Miguel de Icaza (miguel@{ximian.com,gnome.org}),
+ Paolo Molaro (lupus@{ximian.com,debian.org})
+
+
+* Abstract
+
+ Mini is a new compilation engine for the Mono runtime. The
+ new engine is designed to bring new code generation
+ optimizations, portability and pre-compilation.
+
+ In this document we describe the design decisions and the
+ architecture of the new compilation engine.
+
+* Introduction
+
+ Mono is a Open Source implementation of the .NET Framework: it
+ is made up of a runtime engine that implements the ECMA Common
+ Language Infrastructure (CLI), a set of compilers that target
+ the CLI and a large collection of class libraries.
+
+ This article discusses the new code generation facilities that
+ have been added to the Mono runtime.
+
+ First we discuss the overall architecture of the Mono runtime,
+ and how code generation fits into it; Then we discuss the
+ development and basic architecture of our first JIT compiler
+ for the ECMA CIL framework. The next section covers the
+ objectives for the work on the new JIT compiler, then we
+ discuss the new features available in the new JIT compiler,
+ and finally a technical description of the new code generation
+ engine.
+
+* Architecture of the Mono Runtime
+
+ The Mono runtime is an implementation of the ECMA Common
+ Language Infrastructure (CLI), whose aim is to be a common
+ platform for executing code in multiple languages.
+
+ Languages that target the CLI generate images that contain
+ code in high-level intermediate representation called the
+ "Common Intermediate Language". This intermediate language is
+ rich enough to allow for programs and pre-compiled libraries
+ to be reflected. The execution environment allows for an
+ object oriented execution environment with single inheritance
+ and multiple interface implementations.
+
+ This runtime provides a number of services for programs that
+ are targeted to it: Just-in-Time compilation of CIL code into
+ native code, garbage collection, thread management, I/O
+ routines, single, double and decimal floating point,
+ asynchronous method invocation, application domains, and a
+ framework for building arbitrary RPC systems (remoting) and
+ integration with system libraries through the Platform Invoke
+ functionality.
+
+ The focus of this document is on the services provided by the
+ Mono runtime to transform CIL bytecodes into code that is
+ native to the underlying architecture.
+
+ The code generation interface is a set of macros that allow a
+ C programmer to generate code on the fly, this is done
+ through a set of macros found in the mono/jit/arch/ directory.
+ These macros are used by the JIT compiler to generate native
+ code.
+
+ The platform invocation code is interesting, as it generates
+ CIL code on the fly to marshal parameters, and then this
+ code is in turned processed by the JIT engine.
+
+* Previous Experiences
+
+ Mono has built a JIT engine, which has been used to bootstrap
+ Mono since January, 2002. This JIT engine has reasonable
+ performance, and uses an tree pattern matching instruction
+ selector based on the BURS technology. This JIT compiler was
+ designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza.
+
+ The existing JIT compiler has three phases:
+
+ * Re-creation of the semantic tree from CIL
+ byte-codes.
+
+ * Instruction selection, with a cost-driven
+ engine.
+
+ * Code generation and register allocation.
+
+ It is also hooked into the rest of the runtime to provide
+ services like marshaling, just-in-time compilation and
+ invocation of "internal calls".
+
+ This engine constructed a collection of trees, which we
+ referred to as the "forest of trees", this forest is created by
+ "hydrating" the CIL instruction stream.
+
+ The first step was to identify the basic blocks on the method,
+ and computing the control flow graph (cfg) for it. Once this
+ information was computed, a stack analysis on each basic block
+ was performed to create a forest of trees for each one of
+ them.
+
+ So for example, the following statement:
+
+ int a, b;
+ ...
+ b = a + 1;
+
+ Which would be represented in CIL as:
+
+ ldloc.0
+ ldc.i4.1
+ add
+ stloc.1
+
+ After the stack analysis would create the following tree:
+
+ (STIND_I4 ADDR_L[EBX|2] (
+ ADD (LDIND_I4 ADDR_L[ESI|1])
+ CONST_I4[1]))
+
+ This tree contains information from the stack analysis: for
+ instance, notice that the operations explicitly encode the
+ data types they are operating on, there is no longer an
+ ambiguity on the types, because this information has been
+ inferred.
+
+ At this point the JIT would pass the constructed forest of
+ trees to the architecture-dependent JIT compiler.
+
+ The architecture dependent code then performed register
+ allocation (optionally using linear scan allocation for
+ variables, based on life analysis).
+
+ Once variables had been assigned, a tree pattern matching with
+ dynamic programming is used (the tree pattern matcher is
+ custom build for each architecture, using a code
+ generator: monoburg). The instruction selector used cost
+ functions to select the best instruction patterns.
+
+ The instruction selector is able to produce instructions that
+ take advantage of the x86 instruction indexing instructions
+ for example.
+
+ One problem though is that the code emitter and the register
+ allocator did not have any visibility outside the current
+ tree, which meant that some redundant instructions were
+ generated. A peephole optimizer with this architecture was
+ hard to write, given the tree-based representation that is
+ used.
+
+ This JIT was functional, but it did not provide a good
+ architecture to base future optimizations on. Also the
+ line between architecture neutral and architecture
+ specific code and optimizations was hard to draw.
+
+ The JIT engine supported two code generation modes to support
+ the two optimization modes for applications that host multiple
+ application domains: generate code that will be shared across
+ application domains, or generate code that will not be shared
+ across application domains.
+
+* Objectives of the new JIT engine.
+
+ We wanted to support a number of features that were missing:
+
+ * Ahead-of-time compilation.
+
+ The idea is to allow developers to pre-compile their code
+ to native code to reduce startup time, and the working
+ set that is used at runtime in the just-in-time compiler.
+
+ Although in Mono this has not been a visible problem, we
+ wanted to pro-actively address this problem.
+
+ When an assembly (a Mono/.NET executable) is installed in
+ the system, it would then be possible to pre-compile the
+ code, and have the JIT compiler tune the generated code
+ to the particular CPU on which the software is
+ installed.
+
+ This is done in the Microsoft.NET world with a tool
+ called ngen.exe
+
+ * Have a good platform for doing code optimizations.
+
+ The design called for a good architecture that would
+ enable various levels of optimizations: some
+ optimizations are better performed on high-level
+ intermediate representations, some on medium-level and
+ some at low-level representations.
+
+ Also it should be possible to conditionally turn these on
+ or off. Some optimizations are too expensive to be used
+ in just-in-time compilation scenarios, but these
+ expensive optimizations can be turned on for
+ ahead-of-time compilations or when using profile-guided
+ optimizations on a subset of the executed methods.
+
+ * Reduce the effort required to port the Mono code
+ generator to new architectures.
+
+ For Mono to gain wide adoption in the Unix world, it is
+ necessary that the JIT engine works in most of today's
+ commercial hardware platforms.
+
+* Features of the new JIT engine.
+
+ The new JIT engine was architected by Dietmar Maurer and Paolo
+ Molaro, based on the new objectives.
+
+ Mono provides a number of services to applications running
+ with the new JIT compiler:
+
+ * Just-in-Time compilation of CLI code into native code.
+
+ * Ahead-of-Time compilation of CLI code, to reduce
+ startup time of applications.
+
+ A number of software development features are also available:
+
+ * Execution time profiling (--profile)
+
+ Generates a report of the times consumed by routines,
+ as well as the invocation times, as well as the
+ callers.
+
+ * Memory usage profiling (--profile)
+
+ Generates a report of the memory usage by a program
+ that is ran under the Mono JIT.
+
+ * Code coverage (--coverage)
+
+ * Execution tracing.
+
+ People who are interested in developing and improving the Mini
+ JIT compiler will also find a few useful routines:
+
+ * Compilation times
+
+ This is used to time the execution time for the JIT
+ when compiling a routine.
+
+ * Control Flow Graph and Dominator Tree drawing.
+
+ These are visual aids for the JIT developer: they
+ render representations of the Control Flow graph, and
+ for the more advanced optimizations, they draw the
+ dominator tree graph.
+
+ This requires Dot (from the graphwiz package) and Ghostview.
+
+ * Code generator regression tests.
+
+ The engine contains support for running regression
+ tests on the virtual machine, which is very helpful to
+ developers interested in improving the engine.
+
+ * Optimization benchmark framework.
+
+ The JIT engine will generate graphs that compare
+ various benchmarks embedded in an assembly, and run the
+ various tests with different optimization flags.
+
+ This requires Perl, GD::Graph.
+
+* Flexibility
+
+ This is probably the most important component of the new code
+ generation engine. The internals are relatively easy to
+ replace and update, even large passes can be replaced and
+ implemented differently.
+
+* New code generator
+
+ Compiling a method begins with the `mini_method_to_ir' routine
+ that converts the CIL representation into a medium
+ intermediate representation.
+
+ The mini_method_to_ir routine performs a number of operations:
+
+ * Flow analysis and control flow graph computation.
+
+ Unlike the previous version, stack analysis and control
+ flow graphs are computed in a single pass in the
+ mini_method_to_ir function, this is done for performance
+ reasons: although the complexity increases, the benefit
+ for a JIT compiler is that there is more time available
+ for performing other optimizations.
+
+ * Basic block computation.
+
+ mini_method_to_ir populates the MonoCompile structure
+ with an array of basic blocks each of which contains
+ forest of trees made up of MonoInst structures.
+
+ * Inlining
+
+ Inlining is no longer restricted to methods containing
+ one single basic block, instead it is possible to inline
+ arbitrary complex methods.
+
+ The heuristics to choose what to inline are likely going
+ to be tuned in the future.
+
+ * Method to opcode conversion.
+
+ Some method call invocations like `call Math.Sin' are
+ transformed into an opcode: this transforms the call
+ into a semantically rich node, which is later inline
+ into an FPU instruction.
+
+ Various Array methods invocations are turned into
+ opcodes as well (The Get, Set and Address methods)
+
+ * Tail recursion elimination
+
+ Basic blocks ****
+
+ The MonoInst structure holds the actual decoded instruction,
+ with the semantic information from the stack analysis.
+ MonoInst is interesting because initially it is part of a tree
+ structure, here is a sample of the same tree with the new JIT
+ engine:
+
+ (stind.i4 regoffset[0xffffffd4(%ebp)]
+ (add (ldind.i4 regoffset[0xffffffd8(%ebp)])
+ iconst[1]))
+
+ This is a medium-level intermediate representation (MIR).
+
+ Some complex opcodes are decomposed at this stage into a
+ collection of simpler opcodes. Not every complex opcode is
+ decomposed at this stage, as we need to preserve the semantic
+ information during various optimization phases.
+
+ For example a NEWARR opcode carries the length and the type of
+ the array that could be used later to avoid type checking or
+ array bounds check.
+
+ There are a number of operations supported on this
+ representation:
+
+ * Branch optimizations.
+
+ * Variable liveness.
+
+ * Loop optimizations: the dominator trees are
+ computed, loops are detected, and their nesting
+ level computed.
+
+ * Conversion of the method into static single assignment
+ form (SSA form).
+
+ * Dead code elimination.
+
+ * Constant propagation.
+
+ * Copy propagation.
+
+ * Constant folding.
+
+ Once the above optimizations are optionally performed, a
+ decomposition phase is used to turn some complex opcodes into
+ internal method calls. In the initial version of the JIT
+ engine, various operations on longs are emulated instead of
+ being inlined. Also the newarr invocation is turned into a
+ call to the runtime.
+
+ At this point, after computing variable liveness, it is
+ possible to use the linear scan algorithm for allocating
+ variables to registers. The linear scan pass uses the
+ information that was previously gathered by the loop nesting
+ and loop structure computation to favor variables in inner
+ loops. This process updates the basic block `nesting' field
+ which is later used during liveness analysis.
+
+ Stack space is then reserved for the local variables and any
+ temporary variables generated during the various
+ optimizations.
+
+** Instruction selection
+
+ At this point, the BURS instruction selector is invoked to
+ transform the tree-based representation into a list of
+ instructions. This is done using a tree pattern matcher that
+ is generated for the architecture using the `monoburg' tool.
+
+ Monoburg takes as input a file that describes tree patterns,
+ which are matched against the trees that were produced by the
+ engine in the previous stages.
+
+ The pattern matching might have more than one match for a
+ particular tree. In this case, the match selected is the one
+ whose cost is the smallest. A cost can be attached to each
+ rule, and if no cost is provided, the implicit cost is one.
+ Smaller costs are selected over higher costs.
+
+ The cost function can be used to select particular blocks of
+ code for a given architecture, or by using a prohibitive high
+ number to avoid having the rule match.
+
+ The various rules that our JIT engine uses transform a tree of
+ MonoInsts into a list of monoinsts:
+
+ +-----------------------------------------------------------+
+ | Tree List |
+ | of ===> Instruction selection ===> of |
+ | MonoInst MonoInst. |
+ +-----------------------------------------------------------+
+
+ During this process various "types" of MonoInst kinds
+ disappear and turned into lower-level representations. The
+ JIT compiler just happens to reuse the same structure (this is
+ done to reduce memory usage and improve memory locality).
+
+ The instruction selection rules are split in a number of
+ files, each one with a particular purpose:
+
+ inssel.brg
+ Contains the generic instruction selection
+ patterns.
+
+ inssel-x86.brg
+ Contains x86 specific rules.
+
+ inssel-ppc.brg
+ Contains PowerPC specific rules.
+
+ inssel-long32.brg
+ burg file for 64bit instructions on 32bit architectures.
+
+ inssel-long.brg
+ burg file for 64bit architectures.
+
+ inssel-float.brg
+ burg file for floating point instructions
+
+ For a given build, a set of those files would be included.
+ For example, for the build of Mono on the x86, the following
+ set is used:
+
+ inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg
+
+** Native method generation
+
+ The native method generation has a number of steps:
+
+ * Architecture specific register allocation.
+
+ The information about loop nesting that was
+ previously gathered is used here to hint the
+ register allocator.
+
+ * Generating the method prolog/epilog.
+
+ * Optionally generate code to introduce tracing facilities.
+
+ * Hooking into the debugger.
+
+ * Performing any pending fixups.
+
+ * Code generation.
+
+*** Code Generation
+
+ The actual code generation is contained in the architecture
+ specific portion of the compiler. The input to the code
+ generator is each one of the basic blocks with its list of
+ instructions that were produced in the instruction selection
+ phase.
+
+ During the instruction selection phase, virtual registers are
+ assigned. Just before the peephole optimization is performed,
+ physical registers are assigned.
+
+ A simple peephole and algebraic optimizer is ran at this
+ stage.
+
+ The peephole optimizer removes some redundant operations at
+ this point. This is possible because the code generation at
+ this point has visibility into the basic block that spans the
+ original trees.
+
+ The algebraic optimizer performs some simple algebraic
+ optimizations that replace expensive operations with cheaper
+ operations if possible.
+
+ The rest of the code generation is fairly simple: a switch
+ statement is used to generate code for each of the MonoInsts,
+ in the mono/mini/mini-ARCH.c files, the method is called
+ "mono_arch_output_basic_block".
+
+ We always try to allocate code in sequence, instead of just using
+ malloc. This way we increase spatial locality which gives a massive
+ speedup on most architectures.
+
+*** Ahead of Time compilation
+
+ Ahead-of-Time compilation is a new feature of our new
+ compilation engine. The compilation engine is shared by the
+ Just-in-Time (JIT) compiler and the Ahead-of-Time compiler
+ (AOT).
+
+ The difference is on the set of optimizations that are turned
+ on for each mode: Just-in-Time compilation should be as fast
+ as possible, while Ahead-of-Time compilation can take as long
+ as required, because this is not done at a time critical
+ time.
+
+ With AOT compilation, we can afford to turn all of the
+ computationally expensive optimizations on.
+
+ After the code generation phase is done, the code and any
+ required fixup information is saved into a file that is
+ readable by "as" (the native assembler available on all
+ systems). This assembly file is then passed to the native
+ assembler, which generates a loadable module.
+
+ At execution time, when an assembly is loaded from the disk,
+ the runtime engine will probe for the existence of a
+ pre-compiled image. If the pre-compiled image exists, then it
+ is loaded, and the method invocations are resolved to the code
+ contained in the loaded module.
+
+ The code generated under the AOT scenario is slightly
+ different than the JIT scenario. It generates code that is
+ application-domain relative and that can be shared among
+ multiple thread.
+
+ This is the same code generation that is used when the runtime
+ is instructed to maximize code sharing on a multi-application
+ domain scenario.
+
+* SSA-based optimizations
+
+ SSA form simplifies many optimization because each variable
+ has exactly one definition site. This means that each
+ variable is only initialized once.
+
+ For example, code like this:
+
+ a = 1
+ ..
+ a = 2
+ call (a)
+
+ Is internally turned into:
+
+ a1 = 1
+ ..
+ a2 = 2
+ call (a2)
+
+ In the presence of branches, like:
+
+ if (x)
+ a = 1
+ else
+ a = 2
+
+ call (a)
+
+ The code is turned into:
+
+ if (x)
+ a1 = 1;
+ else
+ a2 = 2;
+ a3 = phi (a1, a2)
+ call (a3)
+
+ All uses of a variable are "dominated" by its definition
+
+ This representation is useful as it simplifies the
+ implementation of a number of optimizations like conditional
+ constant propagation, array bounds check removal and dead code
+ elimination.
+
+* Register allocation.
+
+ Global register allocation is performed on the medium
+ intermediate representation just before instruction selection
+ is performed on the method. Local register allocation is
+ later performed at the basic-block level on the
+
+ Global register allocation uses the following input:
+
+ 1) set of register-sized variables that can be allocated to a
+ register (this is an architecture specific setting, for x86
+ these registers are the callee saved register ESI, EDI and
+ EBX).
+
+ 2) liveness information for the variables
+
+ 3) (optionally) loop info to favor variables that are used in
+ inner loops.
+
+ During instruction selection phase, symbolic registers are
+ assigned to temporary values in expressions.
+
+ Local register allocation assigns hard registers to the
+ symbolic registers, and it is performed just before the code
+ is actually emitted and is performed at the basic block level.
+ A CPU description file describes the input registers, output
+ registers, fixed registers and clobbered registers by each
+ operation.
+
+* BURG Code Generator Generator
+
+ monoburg was written by Dietmar Maurer. It is based on the
+ papers from Christopher W. Fraser, Robert R. Henry and Todd
+ A. Proebsting: "BURG - Fast Optimal Instruction Selection and
+ Tree Parsing" and "Engineering a Simple, Efficient Code
+ Generator Generator".
+
+ The original BURG implementation is unable to work on DAGs, instead only
+ trees are allowed. Our monoburg implementations is able to generate tree
+ matcher which works on DAGs, and we use this feature in the new
+ JIT. This simplifies the code because we can directly pass DAGs and
+ don't need to convert them to trees.
+
+* Adding IL opcodes: an excercise (from a post by Paolo Molaro)
+
+ mini.c is the file that read the IL code stream and decides
+ how any single IL instruction is implemented
+ (mono_method_to_ir () func), so you always have to add an
+ entry to the big switch inside the function: there are plenty
+ of examples in that file.
+
+ An IL opcode can be implemented in a number of ways, depending
+ on what it does and how it needs to do it.
+
+ Some opcodes are implemented using a helper function: one of
+ the simpler examples is the CEE_STELEM_REF implementation.
+
+ In this case the opcode implementation is written in a C
+ function. You will need to register the function with the jit
+ before you can use it (mono_register_jit_call) and you need to
+ emit the call to the helper using the mono_emit_jit_icall()
+ function.
+
+ This is the simpler way to add a new opcode and it doesn't
+ require any arch-specific change (though it's limited to what
+ you can do in C code and the performance may be limited by the
+ function call).
+
+ Other opcodes can be implemented with one or more of the already
+ implemented low-level instructions.
+
+ An example is the OP_STRLEN opcode which implements
+ String.Length using a simple load from memory. In this case
+ you need to add a rule to the appropriate burg file,
+ describing what are the arguments of the opcode and what is,
+ if any, it's 'return' value.
+
+ The OP_STRLEN case is:
+
+ reg: OP_STRLEN (reg) {
+ MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1,
+ state->left->reg1, G_STRUCT_OFFSET (MonoString, length));
+ }
+
+ The above means: the OP_STRLEN takes a register as an argument
+ and returns its value in a register. And the implementation
+ of this is included in the braces.
+
+ The opcode returns a value in an integer register
+ (state->reg1) by performing a int32 load of the length field
+ of the MonoString represented by the input register
+ (state->left->reg1): before the burg rules are applied, the
+ internal representation is based on trees, so you get the
+ left/right pointers (state->left and state->right
+ respectively, the result is stored in state->reg1).
+
+ This instruction implementation doesn't require arch-specific
+ changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is
+ available on all platforms), and usually the produced code is
+ fast.
+
+ Next we have opcodes that must be implemented with new low-level
+ architecture specific instructions (either because of performance
+ considerations or because the functionality can't get implemented in
+ other ways).
+
+ You also need a burg rule in this case, too. For example,
+ consider the OP_CHECK_THIS opcode (used to raise an exception
+ if the this pointer is null). The burg rule simply reads:
+
+ stmt: OP_CHECK_THIS (reg) {
+ mono_bblock_add_inst (s->cbb, tree);
+ }
+
+ Note that this opcode does not return a value (hence the
+ "stmt") and it takes a register as input.
+
+ mono_bblock_add_inst (s->cbb, tree) just adds the instruction
+ (the tree variable) to the current basic block (s->cbb). In
+ mini this is the place where the internal representation
+ switches from the tree format to the low-level format (the
+ list of simple instructions).
+
+ In this case the actual opcode implementation is delegated to
+ the arch-specific code. A low-level opcode needs an entry in
+ the machine description (the *.md files in mini/). This entry
+ describes what kind of registers are used if any by the
+ instruction, as well as other details such as constraints or
+ other hints to the low-level engine which are architecture
+ specific.
+
+ cpu-pentium.md, for example has the following entry:
+
+ checkthis: src1:b len:3
+
+ This means the instruction uses an integer register as a base
+ pointer (basically a load or store is done on it) and it takes
+ 3 bytes of native code to implement it.
+
+ Now you just need to provide the low-level implementation for
+ the opcode in one of the mini-$arch.c files, in the
+ mono_arch_output_basic_block() function. There is a big switch
+ here too. The x86 implementation is:
+
+ case OP_CHECK_THIS:
+ /* ensure ins->sreg1 is not NULL */
+ x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0);
+ break;
+
+ If the $arch-codegen.h header file doesn't have the code to
+ emit the low-level native code, you'll need to write that as
+ well.
+
+ Complex opcodes with register constraints may require other
+ changes to the local register allocator, but usually they are
+ not needed.
+
+* Future
+
+ Profile-based optimization is something that we are very
+ interested in supporting. There are two possible usage
+ scenarios:
+
+ * Based on the profile information gathered during
+ the execution of a program, hot methods can be compiled
+ with the highest level of optimizations, while bootstrap
+ code and cold methods can be compiled with the least set
+ of optimizations and placed in a discardable list.
+
+ * Code reordering: this profile-based optimization would
+ only make sense for pre-compiled code. The profile
+ information is used to re-order the assembly code on disk
+ so that the code is placed on the disk in a way that
+ increments locality.
+
+ This is the same principle under which SGI's cord program
+ works.
+
+ The nature of the CIL allows the above optimizations to be
+ easy to implement and deploy. Since we live and define our
+ universe for these things, there are no interactions with
+ system tools required, nor upgrades on the underlying
+ infrastructure required.
+
+ Instruction scheduling is important for certain kinds of
+ processors, and some of the framework exists today in our
+ register allocator and the instruction selector to cope with
+ this, but has not been finished. The instruction selection
+ would happen at the same time as local register allocation. < \ No newline at end of file