diff options
Diffstat (limited to 'docs/mini-doc.txt')
-rw-r--r-- | docs/mini-doc.txt | 771 |
1 files changed, 771 insertions, 0 deletions
diff --git a/docs/mini-doc.txt b/docs/mini-doc.txt new file mode 100644 index 00000000000..bd4b489777e --- /dev/null +++ b/docs/mini-doc.txt @@ -0,0 +1,771 @@ + + A new JIT compiler for the Mono Project + + Miguel de Icaza (miguel@{ximian.com,gnome.org}), + Paolo Molaro (lupus@{ximian.com,debian.org}) + + +* Abstract + + Mini is a new compilation engine for the Mono runtime. The + new engine is designed to bring new code generation + optimizations, portability and pre-compilation. + + In this document we describe the design decisions and the + architecture of the new compilation engine. + +* Introduction + + Mono is a Open Source implementation of the .NET Framework: it + is made up of a runtime engine that implements the ECMA Common + Language Infrastructure (CLI), a set of compilers that target + the CLI and a large collection of class libraries. + + This article discusses the new code generation facilities that + have been added to the Mono runtime. + + First we discuss the overall architecture of the Mono runtime, + and how code generation fits into it; Then we discuss the + development and basic architecture of our first JIT compiler + for the ECMA CIL framework. The next section covers the + objectives for the work on the new JIT compiler, then we + discuss the new features available in the new JIT compiler, + and finally a technical description of the new code generation + engine. + +* Architecture of the Mono Runtime + + The Mono runtime is an implementation of the ECMA Common + Language Infrastructure (CLI), whose aim is to be a common + platform for executing code in multiple languages. + + Languages that target the CLI generate images that contain + code in high-level intermediate representation called the + "Common Intermediate Language". This intermediate language is + rich enough to allow for programs and pre-compiled libraries + to be reflected. The execution environment allows for an + object oriented execution environment with single inheritance + and multiple interface implementations. + + This runtime provides a number of services for programs that + are targeted to it: Just-in-Time compilation of CIL code into + native code, garbage collection, thread management, I/O + routines, single, double and decimal floating point, + asynchronous method invocation, application domains, and a + framework for building arbitrary RPC systems (remoting) and + integration with system libraries through the Platform Invoke + functionality. + + The focus of this document is on the services provided by the + Mono runtime to transform CIL bytecodes into code that is + native to the underlying architecture. + + The code generation interface is a set of macros that allow a + C programmer to generate code on the fly, this is done + through a set of macros found in the mono/jit/arch/ directory. + These macros are used by the JIT compiler to generate native + code. + + The platform invocation code is interesting, as it generates + CIL code on the fly to marshal parameters, and then this + code is in turned processed by the JIT engine. + +* Previous Experiences + + Mono has built a JIT engine, which has been used to bootstrap + Mono since January, 2002. This JIT engine has reasonable + performance, and uses an tree pattern matching instruction + selector based on the BURS technology. This JIT compiler was + designed by Dietmar Maurer, Paolo Molaro and Miguel de Icaza. + + The existing JIT compiler has three phases: + + * Re-creation of the semantic tree from CIL + byte-codes. + + * Instruction selection, with a cost-driven + engine. + + * Code generation and register allocation. + + It is also hooked into the rest of the runtime to provide + services like marshaling, just-in-time compilation and + invocation of "internal calls". + + This engine constructed a collection of trees, which we + referred to as the "forest of trees", this forest is created by + "hydrating" the CIL instruction stream. + + The first step was to identify the basic blocks on the method, + and computing the control flow graph (cfg) for it. Once this + information was computed, a stack analysis on each basic block + was performed to create a forest of trees for each one of + them. + + So for example, the following statement: + + int a, b; + ... + b = a + 1; + + Which would be represented in CIL as: + + ldloc.0 + ldc.i4.1 + add + stloc.1 + + After the stack analysis would create the following tree: + + (STIND_I4 ADDR_L[EBX|2] ( + ADD (LDIND_I4 ADDR_L[ESI|1]) + CONST_I4[1])) + + This tree contains information from the stack analysis: for + instance, notice that the operations explicitly encode the + data types they are operating on, there is no longer an + ambiguity on the types, because this information has been + inferred. + + At this point the JIT would pass the constructed forest of + trees to the architecture-dependent JIT compiler. + + The architecture dependent code then performed register + allocation (optionally using linear scan allocation for + variables, based on life analysis). + + Once variables had been assigned, a tree pattern matching with + dynamic programming is used (the tree pattern matcher is + custom build for each architecture, using a code + generator: monoburg). The instruction selector used cost + functions to select the best instruction patterns. + + The instruction selector is able to produce instructions that + take advantage of the x86 instruction indexing instructions + for example. + + One problem though is that the code emitter and the register + allocator did not have any visibility outside the current + tree, which meant that some redundant instructions were + generated. A peephole optimizer with this architecture was + hard to write, given the tree-based representation that is + used. + + This JIT was functional, but it did not provide a good + architecture to base future optimizations on. Also the + line between architecture neutral and architecture + specific code and optimizations was hard to draw. + + The JIT engine supported two code generation modes to support + the two optimization modes for applications that host multiple + application domains: generate code that will be shared across + application domains, or generate code that will not be shared + across application domains. + +* Objectives of the new JIT engine. + + We wanted to support a number of features that were missing: + + * Ahead-of-time compilation. + + The idea is to allow developers to pre-compile their code + to native code to reduce startup time, and the working + set that is used at runtime in the just-in-time compiler. + + Although in Mono this has not been a visible problem, we + wanted to pro-actively address this problem. + + When an assembly (a Mono/.NET executable) is installed in + the system, it would then be possible to pre-compile the + code, and have the JIT compiler tune the generated code + to the particular CPU on which the software is + installed. + + This is done in the Microsoft.NET world with a tool + called ngen.exe + + * Have a good platform for doing code optimizations. + + The design called for a good architecture that would + enable various levels of optimizations: some + optimizations are better performed on high-level + intermediate representations, some on medium-level and + some at low-level representations. + + Also it should be possible to conditionally turn these on + or off. Some optimizations are too expensive to be used + in just-in-time compilation scenarios, but these + expensive optimizations can be turned on for + ahead-of-time compilations or when using profile-guided + optimizations on a subset of the executed methods. + + * Reduce the effort required to port the Mono code + generator to new architectures. + + For Mono to gain wide adoption in the Unix world, it is + necessary that the JIT engine works in most of today's + commercial hardware platforms. + +* Features of the new JIT engine. + + The new JIT engine was architected by Dietmar Maurer and Paolo + Molaro, based on the new objectives. + + Mono provides a number of services to applications running + with the new JIT compiler: + + * Just-in-Time compilation of CLI code into native code. + + * Ahead-of-Time compilation of CLI code, to reduce + startup time of applications. + + A number of software development features are also available: + + * Execution time profiling (--profile) + + Generates a report of the times consumed by routines, + as well as the invocation times, as well as the + callers. + + * Memory usage profiling (--profile) + + Generates a report of the memory usage by a program + that is ran under the Mono JIT. + + * Code coverage (--coverage) + + * Execution tracing. + + People who are interested in developing and improving the Mini + JIT compiler will also find a few useful routines: + + * Compilation times + + This is used to time the execution time for the JIT + when compiling a routine. + + * Control Flow Graph and Dominator Tree drawing. + + These are visual aids for the JIT developer: they + render representations of the Control Flow graph, and + for the more advanced optimizations, they draw the + dominator tree graph. + + This requires Dot (from the graphwiz package) and Ghostview. + + * Code generator regression tests. + + The engine contains support for running regression + tests on the virtual machine, which is very helpful to + developers interested in improving the engine. + + * Optimization benchmark framework. + + The JIT engine will generate graphs that compare + various benchmarks embedded in an assembly, and run the + various tests with different optimization flags. + + This requires Perl, GD::Graph. + +* Flexibility + + This is probably the most important component of the new code + generation engine. The internals are relatively easy to + replace and update, even large passes can be replaced and + implemented differently. + +* New code generator + + Compiling a method begins with the `mini_method_to_ir' routine + that converts the CIL representation into a medium + intermediate representation. + + The mini_method_to_ir routine performs a number of operations: + + * Flow analysis and control flow graph computation. + + Unlike the previous version, stack analysis and control + flow graphs are computed in a single pass in the + mini_method_to_ir function, this is done for performance + reasons: although the complexity increases, the benefit + for a JIT compiler is that there is more time available + for performing other optimizations. + + * Basic block computation. + + mini_method_to_ir populates the MonoCompile structure + with an array of basic blocks each of which contains + forest of trees made up of MonoInst structures. + + * Inlining + + Inlining is no longer restricted to methods containing + one single basic block, instead it is possible to inline + arbitrary complex methods. + + The heuristics to choose what to inline are likely going + to be tuned in the future. + + * Method to opcode conversion. + + Some method call invocations like `call Math.Sin' are + transformed into an opcode: this transforms the call + into a semantically rich node, which is later inline + into an FPU instruction. + + Various Array methods invocations are turned into + opcodes as well (The Get, Set and Address methods) + + * Tail recursion elimination + + Basic blocks **** + + The MonoInst structure holds the actual decoded instruction, + with the semantic information from the stack analysis. + MonoInst is interesting because initially it is part of a tree + structure, here is a sample of the same tree with the new JIT + engine: + + (stind.i4 regoffset[0xffffffd4(%ebp)] + (add (ldind.i4 regoffset[0xffffffd8(%ebp)]) + iconst[1])) + + This is a medium-level intermediate representation (MIR). + + Some complex opcodes are decomposed at this stage into a + collection of simpler opcodes. Not every complex opcode is + decomposed at this stage, as we need to preserve the semantic + information during various optimization phases. + + For example a NEWARR opcode carries the length and the type of + the array that could be used later to avoid type checking or + array bounds check. + + There are a number of operations supported on this + representation: + + * Branch optimizations. + + * Variable liveness. + + * Loop optimizations: the dominator trees are + computed, loops are detected, and their nesting + level computed. + + * Conversion of the method into static single assignment + form (SSA form). + + * Dead code elimination. + + * Constant propagation. + + * Copy propagation. + + * Constant folding. + + Once the above optimizations are optionally performed, a + decomposition phase is used to turn some complex opcodes into + internal method calls. In the initial version of the JIT + engine, various operations on longs are emulated instead of + being inlined. Also the newarr invocation is turned into a + call to the runtime. + + At this point, after computing variable liveness, it is + possible to use the linear scan algorithm for allocating + variables to registers. The linear scan pass uses the + information that was previously gathered by the loop nesting + and loop structure computation to favor variables in inner + loops. This process updates the basic block `nesting' field + which is later used during liveness analysis. + + Stack space is then reserved for the local variables and any + temporary variables generated during the various + optimizations. + +** Instruction selection + + At this point, the BURS instruction selector is invoked to + transform the tree-based representation into a list of + instructions. This is done using a tree pattern matcher that + is generated for the architecture using the `monoburg' tool. + + Monoburg takes as input a file that describes tree patterns, + which are matched against the trees that were produced by the + engine in the previous stages. + + The pattern matching might have more than one match for a + particular tree. In this case, the match selected is the one + whose cost is the smallest. A cost can be attached to each + rule, and if no cost is provided, the implicit cost is one. + Smaller costs are selected over higher costs. + + The cost function can be used to select particular blocks of + code for a given architecture, or by using a prohibitive high + number to avoid having the rule match. + + The various rules that our JIT engine uses transform a tree of + MonoInsts into a list of monoinsts: + + +-----------------------------------------------------------+ + | Tree List | + | of ===> Instruction selection ===> of | + | MonoInst MonoInst. | + +-----------------------------------------------------------+ + + During this process various "types" of MonoInst kinds + disappear and turned into lower-level representations. The + JIT compiler just happens to reuse the same structure (this is + done to reduce memory usage and improve memory locality). + + The instruction selection rules are split in a number of + files, each one with a particular purpose: + + inssel.brg + Contains the generic instruction selection + patterns. + + inssel-x86.brg + Contains x86 specific rules. + + inssel-ppc.brg + Contains PowerPC specific rules. + + inssel-long32.brg + burg file for 64bit instructions on 32bit architectures. + + inssel-long.brg + burg file for 64bit architectures. + + inssel-float.brg + burg file for floating point instructions + + For a given build, a set of those files would be included. + For example, for the build of Mono on the x86, the following + set is used: + + inssel.brg inssel-x86.brg inssel-long32.brg inssel-float.brg + +** Native method generation + + The native method generation has a number of steps: + + * Architecture specific register allocation. + + The information about loop nesting that was + previously gathered is used here to hint the + register allocator. + + * Generating the method prolog/epilog. + + * Optionally generate code to introduce tracing facilities. + + * Hooking into the debugger. + + * Performing any pending fixups. + + * Code generation. + +*** Code Generation + + The actual code generation is contained in the architecture + specific portion of the compiler. The input to the code + generator is each one of the basic blocks with its list of + instructions that were produced in the instruction selection + phase. + + During the instruction selection phase, virtual registers are + assigned. Just before the peephole optimization is performed, + physical registers are assigned. + + A simple peephole and algebraic optimizer is ran at this + stage. + + The peephole optimizer removes some redundant operations at + this point. This is possible because the code generation at + this point has visibility into the basic block that spans the + original trees. + + The algebraic optimizer performs some simple algebraic + optimizations that replace expensive operations with cheaper + operations if possible. + + The rest of the code generation is fairly simple: a switch + statement is used to generate code for each of the MonoInsts, + in the mono/mini/mini-ARCH.c files, the method is called + "mono_arch_output_basic_block". + + We always try to allocate code in sequence, instead of just using + malloc. This way we increase spatial locality which gives a massive + speedup on most architectures. + +*** Ahead of Time compilation + + Ahead-of-Time compilation is a new feature of our new + compilation engine. The compilation engine is shared by the + Just-in-Time (JIT) compiler and the Ahead-of-Time compiler + (AOT). + + The difference is on the set of optimizations that are turned + on for each mode: Just-in-Time compilation should be as fast + as possible, while Ahead-of-Time compilation can take as long + as required, because this is not done at a time critical + time. + + With AOT compilation, we can afford to turn all of the + computationally expensive optimizations on. + + After the code generation phase is done, the code and any + required fixup information is saved into a file that is + readable by "as" (the native assembler available on all + systems). This assembly file is then passed to the native + assembler, which generates a loadable module. + + At execution time, when an assembly is loaded from the disk, + the runtime engine will probe for the existence of a + pre-compiled image. If the pre-compiled image exists, then it + is loaded, and the method invocations are resolved to the code + contained in the loaded module. + + The code generated under the AOT scenario is slightly + different than the JIT scenario. It generates code that is + application-domain relative and that can be shared among + multiple thread. + + This is the same code generation that is used when the runtime + is instructed to maximize code sharing on a multi-application + domain scenario. + +* SSA-based optimizations + + SSA form simplifies many optimization because each variable + has exactly one definition site. This means that each + variable is only initialized once. + + For example, code like this: + + a = 1 + .. + a = 2 + call (a) + + Is internally turned into: + + a1 = 1 + .. + a2 = 2 + call (a2) + + In the presence of branches, like: + + if (x) + a = 1 + else + a = 2 + + call (a) + + The code is turned into: + + if (x) + a1 = 1; + else + a2 = 2; + a3 = phi (a1, a2) + call (a3) + + All uses of a variable are "dominated" by its definition + + This representation is useful as it simplifies the + implementation of a number of optimizations like conditional + constant propagation, array bounds check removal and dead code + elimination. + +* Register allocation. + + Global register allocation is performed on the medium + intermediate representation just before instruction selection + is performed on the method. Local register allocation is + later performed at the basic-block level on the + + Global register allocation uses the following input: + + 1) set of register-sized variables that can be allocated to a + register (this is an architecture specific setting, for x86 + these registers are the callee saved register ESI, EDI and + EBX). + + 2) liveness information for the variables + + 3) (optionally) loop info to favor variables that are used in + inner loops. + + During instruction selection phase, symbolic registers are + assigned to temporary values in expressions. + + Local register allocation assigns hard registers to the + symbolic registers, and it is performed just before the code + is actually emitted and is performed at the basic block level. + A CPU description file describes the input registers, output + registers, fixed registers and clobbered registers by each + operation. + +* BURG Code Generator Generator + + monoburg was written by Dietmar Maurer. It is based on the + papers from Christopher W. Fraser, Robert R. Henry and Todd + A. Proebsting: "BURG - Fast Optimal Instruction Selection and + Tree Parsing" and "Engineering a Simple, Efficient Code + Generator Generator". + + The original BURG implementation is unable to work on DAGs, instead only + trees are allowed. Our monoburg implementations is able to generate tree + matcher which works on DAGs, and we use this feature in the new + JIT. This simplifies the code because we can directly pass DAGs and + don't need to convert them to trees. + +* Adding IL opcodes: an excercise (from a post by Paolo Molaro) + + mini.c is the file that read the IL code stream and decides + how any single IL instruction is implemented + (mono_method_to_ir () func), so you always have to add an + entry to the big switch inside the function: there are plenty + of examples in that file. + + An IL opcode can be implemented in a number of ways, depending + on what it does and how it needs to do it. + + Some opcodes are implemented using a helper function: one of + the simpler examples is the CEE_STELEM_REF implementation. + + In this case the opcode implementation is written in a C + function. You will need to register the function with the jit + before you can use it (mono_register_jit_call) and you need to + emit the call to the helper using the mono_emit_jit_icall() + function. + + This is the simpler way to add a new opcode and it doesn't + require any arch-specific change (though it's limited to what + you can do in C code and the performance may be limited by the + function call). + + Other opcodes can be implemented with one or more of the already + implemented low-level instructions. + + An example is the OP_STRLEN opcode which implements + String.Length using a simple load from memory. In this case + you need to add a rule to the appropriate burg file, + describing what are the arguments of the opcode and what is, + if any, it's 'return' value. + + The OP_STRLEN case is: + + reg: OP_STRLEN (reg) { + MONO_EMIT_LOAD_MEMBASE_OP (s, tree, OP_LOADI4_MEMBASE, state->reg1, + state->left->reg1, G_STRUCT_OFFSET (MonoString, length)); + } + + The above means: the OP_STRLEN takes a register as an argument + and returns its value in a register. And the implementation + of this is included in the braces. + + The opcode returns a value in an integer register + (state->reg1) by performing a int32 load of the length field + of the MonoString represented by the input register + (state->left->reg1): before the burg rules are applied, the + internal representation is based on trees, so you get the + left/right pointers (state->left and state->right + respectively, the result is stored in state->reg1). + + This instruction implementation doesn't require arch-specific + changes (it is using the MONO_EMIT_LOAD_MEMBASE_OP which is + available on all platforms), and usually the produced code is + fast. + + Next we have opcodes that must be implemented with new low-level + architecture specific instructions (either because of performance + considerations or because the functionality can't get implemented in + other ways). + + You also need a burg rule in this case, too. For example, + consider the OP_CHECK_THIS opcode (used to raise an exception + if the this pointer is null). The burg rule simply reads: + + stmt: OP_CHECK_THIS (reg) { + mono_bblock_add_inst (s->cbb, tree); + } + + Note that this opcode does not return a value (hence the + "stmt") and it takes a register as input. + + mono_bblock_add_inst (s->cbb, tree) just adds the instruction + (the tree variable) to the current basic block (s->cbb). In + mini this is the place where the internal representation + switches from the tree format to the low-level format (the + list of simple instructions). + + In this case the actual opcode implementation is delegated to + the arch-specific code. A low-level opcode needs an entry in + the machine description (the *.md files in mini/). This entry + describes what kind of registers are used if any by the + instruction, as well as other details such as constraints or + other hints to the low-level engine which are architecture + specific. + + cpu-pentium.md, for example has the following entry: + + checkthis: src1:b len:3 + + This means the instruction uses an integer register as a base + pointer (basically a load or store is done on it) and it takes + 3 bytes of native code to implement it. + + Now you just need to provide the low-level implementation for + the opcode in one of the mini-$arch.c files, in the + mono_arch_output_basic_block() function. There is a big switch + here too. The x86 implementation is: + + case OP_CHECK_THIS: + /* ensure ins->sreg1 is not NULL */ + x86_alu_membase_imm (code, X86_CMP, ins->sreg1, 0, 0); + break; + + If the $arch-codegen.h header file doesn't have the code to + emit the low-level native code, you'll need to write that as + well. + + Complex opcodes with register constraints may require other + changes to the local register allocator, but usually they are + not needed. + +* Future + + Profile-based optimization is something that we are very + interested in supporting. There are two possible usage + scenarios: + + * Based on the profile information gathered during + the execution of a program, hot methods can be compiled + with the highest level of optimizations, while bootstrap + code and cold methods can be compiled with the least set + of optimizations and placed in a discardable list. + + * Code reordering: this profile-based optimization would + only make sense for pre-compiled code. The profile + information is used to re-order the assembly code on disk + so that the code is placed on the disk in a way that + increments locality. + + This is the same principle under which SGI's cord program + works. + + The nature of the CIL allows the above optimizations to be + easy to implement and deploy. Since we live and define our + universe for these things, there are no interactions with + system tools required, nor upgrades on the underlying + infrastructure required. + + Instruction scheduling is important for certain kinds of + processors, and some of the framework exists today in our + register allocator and the instruction selector to cope with + this, but has not been finished. The instruction selection + would happen at the same time as local register allocation. <
\ No newline at end of file |