1 files changed, 208 insertions, 0 deletions
diff --git a/docs/local-regalloc.txt b/docs/local-regalloc.txt
new file mode 100644
index 00000000000..a6e523557fe
--- /dev/null
+++ b/docs/local-regalloc.txt
@@ -0,0 +1,208 @@
+
+* Proposal for the local register allocator
+
+	The local register allocator deals with allocating registers
+	for temporaries inside a single basic block, while the global 
+	register allocator is concerned with method-wide allocation of 
+	variables.
+	The global register allocator uses callee-saved register for it's 
+	purpouse so that there is no need to save and restore these registers
+	at call sites.
+
+	There are a number of issues the local allocator needs to deal with:
+	*) some instructions expect operands in specific registers (for example
+		the shl instruction on x86, or the call instruction with thiscall
+		convention, or the equivalent call instructions on other architectures, 
+		such as the need to put output registers in %oX on sparc)
+	*) some instructions deliver results only in specific registers (for example
+		the div instruction on x86, or the call instructionson on almost all
+		the architectures).
+	*) it needs to know what registers may be clobbered by an instruction
+		(such as in a method call)
+	*) it should avoid excessive reloads or stores to improve performance
+	
+	While which specific instructions have limitations is architecture-dependent,
+	the problem shold be solved in an arch-independent way to reduce code duplication.
+	The register allocator will be 'driven' by the arch-dependent code, but it's 
+	implementation should be arch-independent.
+
+	To improve the current local register allocator, we need to
+	keep more state in it than the current setup that only keeps busy/free info.
+
+	Possible state information is:
+
+	free: the resgister is free to use and it doesn't contain useful info
+	freeable: the register contains data loaded from a local (there is 
+		also info about _which_ local it contains) as a result from previous
+		instructions (like, there was a store from the register to the local)
+	moveable: it contains live data that is needed in a following instruction, but
+		the contents may be moved to a different register
+	busy: the register contains live data and it is placed there because
+		the following instructions need it exactly in that register
+	allocated: the register is used by the global allocator
+
+	The local register allocator will have the following interfaces:
+
+	int get_register ();
+		Searches for a register in the free state. If it doesn't find it,
+		searches for a freeable register. Sets the status to moveable.
+		Looking for a 'free' register before a freeable one should allow for
+		removing a few redundant loads (though I'm still unsure if such
+		things should be delegated entirely to the peephole pass).
+	
+	int get_register_force (int reg);
+		Returns 'reg' if it is free or freeable. If it is moveable, it moves it 
+		to another free or freeable register.
+		Sets the status of 'reg' to busy.
+	
+	void set_register_freeable (int reg);
+		Sets the status of 'reg' to freeable.
+	
+	void set_register_free (int reg);
+		Sets the status of 'reg' to free.
+
+	void will_clobber (int reg);
+		Spills the register to the stack. Sets the status to freeable.
+		After the clobbering has occurred, set the status to free.
+
+	void register_unspill (int reg);
+		Un-spills register reg and sets the status to moveable.
+
+	FIXME: how is the 'local' information represented? Maybe a MonoInst* pointer.
+
+	Note: the register allocator will insert instructions in the basic block
+	during it's operation.
+
+* Examples
+
+	Given the tree (on x86 the right argument to shl needs to be in ecx):
+
+	store (local1, shl (local1, call (some_arg)))
+
+	At the start of the basic block, the registers are set to the free state.
+	The sequence of instructions may be:
+		instruction		register status -> [%eax %ecx %edx]
+		start                                       free free free
+		eax = load local1                           mov  free free
+		/* call clobbers eax, ecx, edx */
+		spill eax                                   free free free
+		call                                        mov  free free
+		/* now eax contains the right operand of the shl */
+		mov %eax -> %ecx                            free busy free
+		un-spill                                    mov  busy free
+		shl %cl, %eax                               mov  free free
+	
+	The resulting x86 code is:
+		mov $fffc(%ebp), %eax
+		mov %eax, $fff0(%ebp)
+		push some_arg
+		call func
+		mov %eax, %ecx
+		mov $fff0(%ebp), %eax
+		shl %cl, %eax
+		
+	Note that since shl could operate directly on memory, we could have:
+	
+		push some_arg
+		call func
+		mov %eax, %ecx
+		shl %cl, $fffc(%ebp)
+
+	The above example with loading the operand in a register is just to complicate
+	the example and show that the algorithm should be able to handle it.
+
+	Let's take another example with the this-call call convention (the first argument 
+	is passed in %ecx).
+	In this case, will_clobber() will be called only on %eax and %edx, while %ecx
+	will be allocated with get_register_force ().
+	Note: when a register is allocated with get_register_force(), it should be set
+	to a different state as soon as possible.
+
+	store (local1, shl (local1, this-call (local1)))
+
+		instruction		register status -> [%eax %ecx %edx]
+		start                                       free free free
+		eax = load local1                           mov  free free
+		/* force load in %ecx */
+		ecx = load local1                           mov  busy free
+		spill eax                                   free busy free
+		call                                        mov  free free
+		/* now eax contains the right operand of the shl */
+		mov %eax -> %ecx                            free busy free
+		un-spill                                    mov  busy free
+		shl %cl, %eax                               mov  free free
+
+	What happens when a register that we need to allocate with get_register_force ()
+	contains an operand for the next instruction?
+
+		instruction		register status -> [%eax %ecx %edx]
+		eax = load local0                           mov  free free
+		ecx = load local1                           mov  mov  free
+		get_register_force (ecx) here.
+		We have two options: 
+			mov %ecx, %edx
+		or:
+			spill %ecx
+		The first option is way better (and allows the peephole pass to
+		just load the value in %edx directly, instead of loading first to %ecx).
+		This doesn't work, though, if the instruction clobbers the %edx register
+		(like in a this-call). So, we first need to clobber the registers
+		(so the state of %ecx changes to freebale and there is no issue
+		with get_register_force ()).
+		What if an instruction both clobbers a register and requires it as 
+		an operand? Lets' take the x86 idiv instruction as an example: it
+		requires the dividend in edx:eax and returns the result in eax,
+		with the modulus in edx.
+	
+	store (local1, div (local1, local2))
+		
+		instruction		register status -> [%eax %ecx %edx]
+		eax = load local0                           mov  free free
+		will_clobber eax, edx                       free mov  free
+		force mov %ecx, %eax                        busy free free
+		set %edx                                    busy free busy
+		idiv                                        mov  free free
+	
+	Note: edx is set to free after idiv, because the modulus is not needed
+	(if it was a rem, eax would have been freed).
+	If we load the divisor before will_clobber(), we'll have to spill
+	eax and reload it later. If we load it just after the idiv, there is no issue.
+	In any case, the algorithm should give the correct results and allow the operation.
+		
+	Working recursively on the isntructions there shouldn't be huge issues
+	with this algorithm (though, of course, it's not optimal and it may
+	introduce excessive spills or register moves). The advantage over the current
+	local reg allocator is that:
+	1) the number of spills/moves would be smaller anyway
+	2) a separate peephole pass could be able to eliminate reg moves
+	3) we'll be able to remove the 'forced' spills we currently do with
+		the return value of method calls
+
+* Issues
+
+	How to best integrate such a reg allocator with the burg stuff.
+
+	Think about a call os sparc with two arguments: they got into %o0 and %o1
+	and each of them sets the register as busy. But what if the values to put there
+	are themselves the result of a call? %o0 is no problem, but for all the 
+	next argument n the above algorithm would spill all the 0...n-1 registers...
+
+* Papers
+
+	More complex solutions to the local register allocator problem:
+	http://dimacs.rutgers.edu/TechnicalReports/abstracts/1997/97-33.html
+
+	Combining register allocation and instruction scheduling:
+	http://citeseer.nj.nec.com/motwani95combining.html
+
+	More on LRA euristics:
+	http://citeseer.nj.nec.com/liberatore97hardness.html
+
+	Linear-time optimal code scheduling for delayedload architectures
+	http://www.cs.wisc.edu/~fischer/cs701.f01/inst.sched.ps.gz
+
+	Precise Register Allocation for Irregular Architectures
+	http://citeseer.nj.nec.com/kong98precise.html
+
+	Allocate registers first to subtrees that need more of them.
+	http://www.upb.de/cs/ag-kastens/compii/folien/comment401-409.2.pdf