diff options
Diffstat (limited to 'docs/local-regalloc.txt')
-rw-r--r-- | docs/local-regalloc.txt | 208 |
1 files changed, 208 insertions, 0 deletions
diff --git a/docs/local-regalloc.txt b/docs/local-regalloc.txt new file mode 100644 index 00000000000..a6e523557fe --- /dev/null +++ b/docs/local-regalloc.txt @@ -0,0 +1,208 @@ + +* Proposal for the local register allocator + + The local register allocator deals with allocating registers + for temporaries inside a single basic block, while the global + register allocator is concerned with method-wide allocation of + variables. + The global register allocator uses callee-saved register for it's + purpouse so that there is no need to save and restore these registers + at call sites. + + There are a number of issues the local allocator needs to deal with: + *) some instructions expect operands in specific registers (for example + the shl instruction on x86, or the call instruction with thiscall + convention, or the equivalent call instructions on other architectures, + such as the need to put output registers in %oX on sparc) + *) some instructions deliver results only in specific registers (for example + the div instruction on x86, or the call instructionson on almost all + the architectures). + *) it needs to know what registers may be clobbered by an instruction + (such as in a method call) + *) it should avoid excessive reloads or stores to improve performance + + While which specific instructions have limitations is architecture-dependent, + the problem shold be solved in an arch-independent way to reduce code duplication. + The register allocator will be 'driven' by the arch-dependent code, but it's + implementation should be arch-independent. + + To improve the current local register allocator, we need to + keep more state in it than the current setup that only keeps busy/free info. + + Possible state information is: + + free: the resgister is free to use and it doesn't contain useful info + freeable: the register contains data loaded from a local (there is + also info about _which_ local it contains) as a result from previous + instructions (like, there was a store from the register to the local) + moveable: it contains live data that is needed in a following instruction, but + the contents may be moved to a different register + busy: the register contains live data and it is placed there because + the following instructions need it exactly in that register + allocated: the register is used by the global allocator + + The local register allocator will have the following interfaces: + + int get_register (); + Searches for a register in the free state. If it doesn't find it, + searches for a freeable register. Sets the status to moveable. + Looking for a 'free' register before a freeable one should allow for + removing a few redundant loads (though I'm still unsure if such + things should be delegated entirely to the peephole pass). + + int get_register_force (int reg); + Returns 'reg' if it is free or freeable. If it is moveable, it moves it + to another free or freeable register. + Sets the status of 'reg' to busy. + + void set_register_freeable (int reg); + Sets the status of 'reg' to freeable. + + void set_register_free (int reg); + Sets the status of 'reg' to free. + + void will_clobber (int reg); + Spills the register to the stack. Sets the status to freeable. + After the clobbering has occurred, set the status to free. + + void register_unspill (int reg); + Un-spills register reg and sets the status to moveable. + + FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer. + + Note: the register allocator will insert instructions in the basic block + during it's operation. + +* Examples + + Given the tree (on x86 the right argument to shl needs to be in ecx): + + store (local1, shl (local1, call (some_arg))) + + At the start of the basic block, the registers are set to the free state. + The sequence of instructions may be: + instruction register status -> [%eax %ecx %edx] + start free free free + eax = load local1 mov free free + /* call clobbers eax, ecx, edx */ + spill eax free free free + call mov free free + /* now eax contains the right operand of the shl */ + mov %eax -> %ecx free busy free + un-spill mov busy free + shl %cl, %eax mov free free + + The resulting x86 code is: + mov $fffc(%ebp), %eax + mov %eax, $fff0(%ebp) + push some_arg + call func + mov %eax, %ecx + mov $fff0(%ebp), %eax + shl %cl, %eax + + Note that since shl could operate directly on memory, we could have: + + push some_arg + call func + mov %eax, %ecx + shl %cl, $fffc(%ebp) + + The above example with loading the operand in a register is just to complicate + the example and show that the algorithm should be able to handle it. + + Let's take another example with the this-call call convention (the first argument + is passed in %ecx). + In this case, will_clobber() will be called only on %eax and %edx, while %ecx + will be allocated with get_register_force (). + Note: when a register is allocated with get_register_force(), it should be set + to a different state as soon as possible. + + store (local1, shl (local1, this-call (local1))) + + instruction register status -> [%eax %ecx %edx] + start free free free + eax = load local1 mov free free + /* force load in %ecx */ + ecx = load local1 mov busy free + spill eax free busy free + call mov free free + /* now eax contains the right operand of the shl */ + mov %eax -> %ecx free busy free + un-spill mov busy free + shl %cl, %eax mov free free + + What happens when a register that we need to allocate with get_register_force () + contains an operand for the next instruction? + + instruction register status -> [%eax %ecx %edx] + eax = load local0 mov free free + ecx = load local1 mov mov free + get_register_force (ecx) here. + We have two options: + mov %ecx, %edx + or: + spill %ecx + The first option is way better (and allows the peephole pass to + just load the value in %edx directly, instead of loading first to %ecx). + This doesn't work, though, if the instruction clobbers the %edx register + (like in a this-call). So, we first need to clobber the registers + (so the state of %ecx changes to freebale and there is no issue + with get_register_force ()). + What if an instruction both clobbers a register and requires it as + an operand? Lets' take the x86 idiv instruction as an example: it + requires the dividend in edx:eax and returns the result in eax, + with the modulus in edx. + + store (local1, div (local1, local2)) + + instruction register status -> [%eax %ecx %edx] + eax = load local0 mov free free + will_clobber eax, edx free mov free + force mov %ecx, %eax busy free free + set %edx busy free busy + idiv mov free free + + Note: edx is set to free after idiv, because the modulus is not needed + (if it was a rem, eax would have been freed). + If we load the divisor before will_clobber(), we'll have to spill + eax and reload it later. If we load it just after the idiv, there is no issue. + In any case, the algorithm should give the correct results and allow the operation. + + Working recursively on the isntructions there shouldn't be huge issues + with this algorithm (though, of course, it's not optimal and it may + introduce excessive spills or register moves). The advantage over the current + local reg allocator is that: + 1) the number of spills/moves would be smaller anyway + 2) a separate peephole pass could be able to eliminate reg moves + 3) we'll be able to remove the 'forced' spills we currently do with + the return value of method calls + +* Issues + + How to best integrate such a reg allocator with the burg stuff. + + Think about a call os sparc with two arguments: they got into %o0 and %o1 + and each of them sets the register as busy. But what if the values to put there + are themselves the result of a call? %o0 is no problem, but for all the + next argument n the above algorithm would spill all the 0...n-1 registers... + +* Papers + + More complex solutions to the local register allocator problem: + http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html + + Combining register allocation and instruction scheduling: + http://citeseer.nj.nec.com/motwani95combining.html + + More on LRA euristics: + http://citeseer.nj.nec.com/liberatore97hardness.html + + Linear-time optimal code scheduling for delayedload architectures + http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz + + Precise Register Allocation for Irregular Architectures + http://citeseer.nj.nec.com/kong98precise.html + + Allocate registers first to subtrees that need more of them. + http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf |