markspace: BlitzMax Garbage Collection part 1...

Wednesday, September 2, 2009

BlitzMax Garbage Collection part 1...

Hi,

Ok, a few random comments on the history and design of BlitzMax's garbage collection system - probably not for the faint of heart...

The GC system is BlitzMax has actually gone through several iterations by now. The first was, ahem, a little rough. It was basically a reference counting system that required the programmer to manually call a 'FlushMem' command every now and then. Worse than that, they couldn't just hide the FlushMem in a function somewhere, it had to be in the right place 'on the stack' as it couldn't 'flush' anything higher up the stack.

Yes, users found it a little confusing!

But it worked, and I eventually managed to get rid of the need for FlushMem entirely. This is the GC BlitzMax has used for a long time now and it has one very cool thing going for it: it's *fast*!

However, it has 2 fairly serious disadvantages - it can't handle 'cycles' of objects (this is where a chunk of memory directly or indirectly points to itself - such memory can 'leak') and it doesn't work nicely with multithreading. And EVERYONE wants multithreading these days, right?

So I decided it was time to implement a 'real' garbage collector - one that could handle both cycles and threading.

Fortunately, thanks to the wonders of open source software, there is an excellent GC available that is *almost* perfect for BlitzMax: the 'Boehm-Demers-Weiser' (BDWGC) conservative garbage collector:

http://www.hpl.hp.com/personal/Hans_Boehm/gc/

This is already in use by many pieces of software, so is well tested, and is pretty easy to 'drop into' an existing project such as BlitzMax.

And, best of it all - it just works! By using this in your software, you can allocate memory and forget about it. It is one seriously cool piece of programming...

However, there were a few things I was a little uncomfortable with about it:

* BDWGC was designed to handle any old generic code, yet I 'know' stuff about BlitzMax code that they don't - surely I could use this knowledge to make the GC more efficient? One example: C/C++ etc code can often generate 'derived' pointers. This is where a pointer points 'into the middle' of an object, not at the beginning. However, with BlitzMax code you don't have to worry about derived pointers (not coincidentally due to changes I had to make to get rid of FlushMem way back!) so the code in BDWGC to 'validate' a pointer is unnecessarily complex.

* There appears to be a bug in BDWGC to do with 'cyclic finalizers' which can cause crashes or leaks. No doubt, this will be fixed eventually, but hey, it's open source so surely I can fix it myself? Unfortunately…

* The BDWGC code is pretty hard to follow. It's designed to run on about a gazillion versions of EVERY OS ever, has a ton of optional features and is 'macro-ed' up the wazoo. After a few hours trying to hunt down the cyclic finalizer problem, I realized I was out of my depth and gave up (Actually, this may have more to do with me - I kind of suck at reading other people's code...)

So, the last effort on the GC front has been to write a 'proper' cycle/threading friendly GC on my own (MSGC), that can take advantage of the way BlitzMax works - and the results are looking good!

So far, MSGC is faster than BDWGC (although not as much as I would like!) with everything we've tested it with and appears to be very stable (although I'm sure the upcoming public release will change that).

It's been a pretty interesting experience writing it too, and I'll go into some of the details in a future posting, but that's probably enough for now.

Peace out!
Mark

7 comments:

Josh KlintSeptember 2, 2009 at 7:31 PM
How does the speed of the ST GC compare to your new one?
ReplyDelete
Replies
marksiblySeptember 2, 2009 at 7:43 PM
Hi,

The single threaded ref counting GC is still noticably faster than either of the threaded ones, and I suspect it always will be as it's much simpler.

The reason we can't use the ref counting GC for threaded apps is because the reference count updates are not atomic.

They can be made atomic by using the x86 LOCK prefix, but this slows down ref count updates pretty dramatically.

Not sure exactly how dramatically, but when I tried it several years back LOTS of stuff started losing frames. Besides, I'd rather work towards a proper GC That can handle cycles etc anyway.

Mark
ReplyDelete
Replies
The Five17September 3, 2009 at 1:23 PM
Hi Mark,

I'm extremely interested in BlitzMax but did hold back so far because of the lacking multithreading support.
One question: what is your opinion on functional programming? I'm not a crack (in fact hated it first) but if you get used to it's a very elegant way to solve *some* problems. Most funcional compilers also seem to scale well in terms of multithreading :)
ReplyDelete
Replies
marksiblySeptember 3, 2009 at 4:42 PM
Hi,

> What is your opinion on functional programming.

To be honest, I've never actually used it in a project so don't really feel qualified to have an opinion on it one way or another.

I read up on it a fair bit and think I more or less get it, but that's a *whole* lot different from having used it in real life.

Anyways, lots of dudes on Slashdot etc seem to think it's the bees knees, so it at least has beardy-cred going for it.

Mark
ReplyDelete
Replies
UnknownSeptember 4, 2009 at 8:59 AM
All I care about is speed. The performance hit of the multithreaded GC means BlitzMax threading is not useful to me for any performance-critical code. I'd prefer MT GC that didn't catch circular references, or selective GC that only works on some objects.

At this point it makes the most sense to offload my renderer into C++ and run it on multiple threads there (the culling stage can be multithreaded, and is the most CPU-intensive portion). However, this is a pretty big task and I don't know when it will happen.
ReplyDelete
Replies
marksiblySeptember 4, 2009 at 6:18 PM
Hi,

Any 'thread frindly' code is likely to have increased overhead compared to similar 'thread oblivious' code - there's just more work to be done. Even (as I pointed out above) the current non-cyclic, single threaded GC slows down significantly when made thread friendly.

But just what is the performance hit of the new GC? Have you tried to estimate/quantify it? What is the estimated performance benefit of threading? If the new GC 'costs' 10% speed but threading your app 'gains' 20%, then it's a win.

And long term, as we get more and more cores and learn better how to use them, the performance benefits of threading are just gonna explode, whereas the costs will stay the same.

Not saying this necessarily applies to your particular work...but I think by just stopping at 'the new GC is slower' you may be missing the bigger picture.

Mark
ReplyDelete
Replies
Harald ReisingerJanuary 6, 2010 at 4:37 AM
Hi Mark,

I recently found some interessting papers about GV from IBM, maybe there is something interesting four you there

best regards

Harry
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.