docs/precise-gc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

While the Boehm GC is quite good, we need to move to a 
precise, generational GC for better performance and smaller
memory usage (no false-positives memory retentions with big
allocations).

This is a large task, but it can be done in steps.

1) use the GCJ support to mark reference fields in objects, so
scanning the heap is faster. This is mostly done already, needs
checking that it is always used correctly (big objects, arrays).
There are also some data structures we use in the runtime that are
currently untyped that are allocated in the Gc heap and used to 
keep references to GC objects. We need to make them typed as to
precisely track GC references or make them non-GC memory,
by using more the GC hnadle support code (MonoGHashTable, MonoDomain, 
etc).

2) don't include in the static roots the .bss and .data segments
to save in scanning time and limit false-positives. This is mostly 
done already.

3) keep track precisely of stack locations and registers in native
code generation. This basically requires the regalloc rewrite code 
first, if we don't want to duplicate much of it. This is the hardest 
task of all, since proving it's correctness is very hard. Some tricks,
like having a build that injects GC.Collect() after every few simple 
operations may help. We also need to decide if we want to handle safe 
points at calls and back jumps only or at every instruction. The latter
case is harder to implement and requires we keep around much more data
(it potentially makes for faster stop-the-world phases).
The first case requires us to be able to advance a thread until it 
reaches the next safe point: this can be done with the same techniques 
used by a debugger. We already need something like this to handle
safely aborts happening in the middle of a prolog in managed code, 
for example, so this could be an additional sub-task that can be done
separately from the GC work.
Note that we can adapt the libgc code to use the info we collect
when scanning the stack in managed methods and still use the conservative
approach for the unmanaged stack, until we have our own collector,
which requires we define a proper icall interface to switch from managed 
to unmanaged code (hwo to we handle object references in the icall 
implementations, for example).

4) we could make use of the generational capabilities of the 
Boehm GC, but not with the current method involving signals which
may create incompatibilities and is not supported on all platforms.
We need to start using write barriers: they will be required anyway
for the generational GC we'll use. When a field holding a reference
is changed in an object (or an item in an array), we mark the card
or page where the field is stored as dirty. Later, when a collection 
is run, only objects in pages marked as dirty are scanned for
references instead of the whole heap. This could take a few days to
implement and probably much more time to debug if all the cases were 
not catched:-)

5) actually write the new generational and precise collector. There are
several examples out there as open source projects, though the CLR
needs some specific semantics so the code needs to be written from 
scratch anyway. Compared to item 3 this is relatively easer and it can
be tested outside of mono, too, until mono is ready to use it.
The important features needed:
*) precise, so there is no false positive memory retention
*) generational to reduce collection times
*) pointer-hopping allocation to reduce alloc time
*) possibly per-thread lock-free allocation
*) handle weakrefs and finalizers with the CLR semantics

Note: some GC engines use a single mmap area, because it makes
handling generations and the implementation much easier, but this also 
limits the expension of the heap, so people may need to use a command-line
option to set the max heap size etc. It would be better to have a design 
that allows mmapping a few megabytes chunks at a time.

The different tasks can be done in parallel. 1, 2 and 4 can be done in time
for the mono 1.2 release. Parts of 3 and 5 could be done as well.
The complete switch is supposed to happen with the mono 2.0 release.