gbadev.org forum archive

?? what are the differences between them??

swiCopy is a BIOS call. it stops CPU execution and IRQs. There is believed to be a bug in one of it's modes though (called swiFastCopy) - the first 3/4N bytes will be copied fast, while the last 1/4N will be copied very slowly.

DMA copy is done by an external controller, and is also copying while CPU halted in addition to IRQs.

I remember dsboi doing a speed test to check up which of the 4 copy methods are the fastest. ASM copy, which was a 32bit ldmia writes, memcpy, DMA copy in 32bit mode and swiCopy (and swiFastCopy)

Those were the results (I don't have the actual numbers)

1) ldmia copy
2) memcpy
3) DMA copy
4) swiCopy
5) swiFastCopy

I'll have to find that benchmark .nds...
_________________
Private property.
Violators will be shot, survivors will be shot again.

Hrm, I remember reading about that benchmark test, but the page I read it on seemed to be somewhat dated. Happen to know what version (or around what version) of libnds they were compiled with?

The libnds version isn't at all relevant.

The gcc version *may* have some bearing on the speed but not much. The only function used for testing written in C was memcpy IIRC.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Oh, for some reason I thought the "bug" in swiFastCopy was a bug in libnds, rather than a malfunction in the hardware implementation. Thanks for the clarification.

Sorry to double post, but I'm a huge attention whore, and I wanted to get some attention :>

Anyway, I've never really been satisfied with the benchmarks done on memcpy/swiCopy/dmaCopy, simply because they all seem to show that memcpy is faster than dmaCopy. dmaCopy is a memory transfer with a dedicated controller, so why the hell would it be faster than a software memcpy?

Additionally, all of the benchmarks I've ever seen only gave a single number for each of the functions, rather than testing over a variety of transfer sizes and locations (main memory vs. VRAM vs. shared memory, etc). So I decided to hunker down and do some testing of my own, and the results were... informative.

I wrote a small app (source+binary to follow) to test all power-of-two block sizes from 16-1048576 bytes, using either main memory or VRAM as src/dst, and main memory for dst/src (I didn't want to end up with the situation where the source and destination are the same address, so only 1 of them is allowed to be VRAM).

Additionally, I mapped all the VRAM to 0x6000000, assuming that it will perform the same regardless of the mapping (which makes sense to me, but may not be the case).

Oh god I have to make a table. These values were obtained with the binary (to follow) on my DS Lite; there's probably at least 5 bugs in the source, so I wouldn't call them very conclusive, but they definitely do suggest conclusions -

Code:

Main Memory to Main Memory

16B 32B 64B 128B 256B 512B 1024B 2048B
memcpy 41 52 73 115 312 731 1219 3906
dmaCopy 94 158 286 542 1054 2118 4148 8244
swiCopy 123 167 221 383 770 1440 2716 6691

4096B 8192B 16384B 32768B 65536B
memcpy 11063 23697 48677 98291 196727
dmaCopy 16481 32887 65655 131263 262338
swiCopy 17789 38230 77787 155769 311512

etc, you get the picture. This matches what I've read so far, memcpy > dmaCopy > swiCopy, which is expected. When you switch over to VRAM, however, a different picture emerges -

Code:

Main Memory to VRAM

16B 32B 64B 128B 256B 512B 1024B 2048B
dmaCopy 48 56 72 104 168 324 592 1145
memcpy 58 85 139 247 463 939 1891 4001
swiCopy 168 204 360 598 1118 2124 4212 8456

4096B 8192B 16384B 32768B 65536B
dmaCopy 2226 4375 8627 17125 34121
memcpy 8711 18657 38657 77907 155903
swiCopy 17828 37827 76815 153746 307480

Anyway, so it appears that dmaCopy is significantly faster for transferring any size of data to VRAM, while memcpy is faster when working only in main memory. I would suspect that the dma hardware doesn't actually kick in unless you're transferring to VRAM, but I have no foundation for that conclusion.

I'm sure a lot of people already knew about this (or I've made a mistake somewhere ._.) but I've always seem to have read that memcpy is faster than everything else etc, and just though that was odd.

Blarg.

Sauce + Binary
Just the Binary

:<

Oh, and controls for the thing -
* A: Double the amount of memory allocated.
* B: Halve the amount of memory allocated.
* X: Toggle destination (VRAM/Main Memory)
* Y: Toggle source (VRAM/Main Memory)

I think the larger part of the DMA controller's benefit is being able to copy asynchronously (for some value thereof, anyway). But this benefit is lost when using dmaCopy(), since it spins on the control register and only returns once the copy is actually complete. Of course if you're courageous and use dmaCopyAsynch() or its variants, you then have to manage the synchronisation yourself, which in most cases is probably more pain in the arse than it's worth...

Yo, what are the units in that table? Good work anyway :-)

One thing I'm interested in is how exactly does the copy loop in memcpy work? Does it copy in word quantites until it has less than a word of memory to go, or does it do the whole transfer by just copying bytes?
I bet you could write a really fast asm copy assuming that the source + dest are 16/32-bit aligned and the size is some multiple of something...

Reckon you could also do the test on main memory to the uncached memory map? memcpy is probably faster to main memory because it uses the cache, but dmacopy and swicopy don't. To make it uncached OR your addresses with 0x400000.
_________________
Big thanks to everyone who donated for Quake2

memcpy source from newlib

Code:

/*
FUNCTION
<<memcpy>>---copy memory regions

ANSI_SYNOPSIS
#include <string.h>
void* memcpy(void *<[out]>, const void *<[in]>, size_t <[n]>);

TRAD_SYNOPSIS
void *memcpy(<[out]>, <[in]>, <[n]>
void *<[out]>;
void *<[in]>;
size_t <[n]>;

DESCRIPTION
This function copies <[n]> bytes from the memory region
pointed to by <[in]> to the memory region pointed to by
<[out]>.

If the regions overlap, the behavior is undefined.

RETURNS
<<memcpy>> returns a pointer to the first byte of the <[out]>
region.

PORTABILITY
<<memcpy>> is ANSI C.

<<memcpy>> requires no supporting OS subroutines.

QUICKREF
memcpy ansi pure
*/

#include <_ansi.h>
#include <stddef.h>
#include <limits.h>

/* Nonzero if either X or Y is not aligned on a "long" boundary. */
#define UNALIGNED(X, Y) \
(((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the 4X unrolled loop. */
#define BIGBLOCKSIZE (sizeof (long) << 2)

/* How many bytes are copied each iteration of the word copy loop. */
#define LITTLEBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier. */
#define TOO_SMALL(LEN) ((LEN) < BIGBLOCKSIZE)

_PTR
_DEFUN (memcpy, (dst0, src0, len0),
_PTR dst0 _AND
_CONST _PTR src0 _AND
size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
char *dst = (char *) dst0;
char *src = (char *) src0;

_PTR save = dst0;

while (len0--)
{
*dst++ = *src++;
}

return save;
#else
char *dst = dst0;
_CONST char *src = src0;
long *aligned_dst;
_CONST long *aligned_src;
int len = len0;

/* If the size is small, or either SRC or DST is unaligned,
then punt into the byte copy loop. This should be rare. */
if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
{
aligned_dst = (long*)dst;
aligned_src = (long*)src;

/* Copy 4X long words at a time if possible. */
while (len >= BIGBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
*aligned_dst++ = *aligned_src++;
len -= BIGBLOCKSIZE;
}

/* Copy one long word at a time if possible. */
while (len >= LITTLEBLOCKSIZE)
{
*aligned_dst++ = *aligned_src++;
len -= LITTLEBLOCKSIZE;
}

/* Pick up any residual with a byte copier. */
dst = (char*)aligned_dst;
src = (char*)aligned_src;
}

while (len--)
*dst++ = *src++;

return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}

You can certainly write a fast asm copy like that but it won't be a drop in replacement for memcpy.

Why wouldn't swiCopy use the cache?

The most interesting thing about these tests that I noticed was this.

Code:

1K 2K
dmaCopy main to main 4148 8244
dmaCopy main to VRAM 592 1145

Notice anything there?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

You mean besides being 8 times more quick?
_________________
AmplituDS website

Does it really matter how quick it is when most people are using MicroSD cards, some of the slowest on the planet?

Dark Knight ez wrote:

You mean besides being 8 times more quick?

Which suggests what?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

NeX wrote:

Does it really matter how quick it is when most people are using MicroSD cards, some of the slowest on the planet?

Most slot 2 cards load everything into gba cart space before executing the app.

Data doesn't necessarily have to be loaded from the filesystem *every* time you want to use it. That all depends on how well you've thought things out.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

wintermute wrote:

Dark Knight ez wrote:

You mean besides being 8 times more quick?

Which suggests what?

Erm... that dmaCopy can copy entire blocks of 8*2bytes at one time to VRAM as opposed to just 2bytes at a time like memcpy does? Or am I missing something?
_________________
AmplituDS website

What happens when you cache the VRAM? x_x
_________________
http://licklick.wordpress.com

Lick wrote:

What happens when you cache the VRAM? x_x

Your graphics break.

Not entirely sure what you're getting at either.

Cache would only affect the CPU copies, DMA doesn't go through the cache - this is why cache needs to be flushed before a DMA transfer when the source memory is cached and invalidated after if the destination is cached.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Last edited by wintermute on Sat Jun 02, 2007 4:13 pm; edited 1 time in total

Dark Knight ez wrote:

Erm... that dmaCopy can copy entire blocks of 8*2bytes at one time to VRAM as opposed to just 2bytes at a time like memcpy does? Or am I missing something?

I'm not really sure I follow your logic here.

The dma from main RAM to VRAM is approximately 8 times faster than from main RAM to main RAM. This implies that copying from main RAM to VRAM can copy 8 times as much data in the same time period. Nothing can be inferred about the amount of data transferred in a single chunk.

Assuming that VRAM to main RAM is the same speed then DMA from main to VRAM & back to main should theoretically be 4 times faster than main to main.

Right now I'm not sure why this should be the case. I'm open to suggestions.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

This little bit from gbatek may be of interest:

From http://nocash.emubase.de/gbatek.htm#dsdmatransfers:

Quote:

NDS Sequential Main Memory DMA
Main RAM has different access time for sequential and non-sequential access. Normally DMA uses sequential access (except for the first word), however, if the source and destination addresses are both in Main RAM, then all accesses become non-sequential. In that case it would be faster to use two DMA transfers, one from Main RAM to a scratch buffer in WRAM, and one from WRAM to Main RAM.

Unfortunately, I can't find what the waitstates for DS sections are in gbatek :\

Also fun is adding FlushAll() right before the copies. This has quite a large effect on the results (except, of course, for DMA).

The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?

mml wrote:

I think the larger part of the DMA controller's benefit is being able to copy asynchronously (for some value thereof, anyway). But this benefit is lost when using dmaCopy(), since it spins on the control register and only returns once the copy is actually complete. Of course if you're courageous and use dmaCopyAsynch() or its variants, you then have to manage the synchronisation yourself, which in most cases is probably more pain in the arse than it's worth...

Yeah, I thought the same thing at first too, since I figured there'd be four separate DMA controllers, each controlled by one of the registers. In my tests though, this query -

Code:

void query_quadDmaCopy() {
PROF_START();
int part_size = 1 << ( mem_size - 2 );
dmaCopyWordsAsynch( 0, psrc + 0*part_size, pdst + 0*part_size, part_size );
dmaCopyWordsAsynch( 1, psrc + 1*part_size, pdst + 1*part_size, part_size );
dmaCopyWordsAsynch( 2, psrc + 2*part_size, pdst + 2*part_size, part_size );
dmaCopyWordsAsynch( 3, psrc + 3*part_size, pdst + 3*part_size, part_size );
while ( dmaBusy(0) || dmaBusy(1) || dmaBusy(2) || dmaBusy(3) );
PROF_END( t_quadDmaCopy );
}

Always had a higher delay than the single DMA transfer. I'm not sure if this is a flaw in my understanding/implementation of the copy, but it raises doubts on the idea that there is a unique DMA controller for each channel. Might just be the same controller juggling all four channels, which kills any benefit you'd normally get from parallel copies. If that's the case though, dunno why they'd have four separate channels in the first place :<

Well. I guess it would be useful if you were loading things in the background (like changing tilesets without a loading screen!) D:

If you want the binary/source with the 4-channel DMA transfer (or just numbers) I can post those too.

simonjhall wrote:

Yo, what are the units in that table?

lol I actually have no idea, I just stole the timer code straight from TONC since I didn't want to have to figure out how to properly set up the timers with the right frequency so they don't overflow, cascaded right, etc. It doesn't really matter what the units are for comparison purposes, since all of the tests use the same timer setup :3

<3 TONC.

Cearn wrote:

The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?

Whoooops. Changed the dma query function from -

Code:

void query_dmaCopy() {
PROF_START();
dmaCopy( psrc, pdst, 1 << (mem_size-1) );
PROF_END( t_dmaCopy );
}

to

Code:

void query_dmaCopy() {
PROF_START();
dmaCopyWords( 3, psrc, pdst, 1 << (mem_size) );
PROF_END( t_dmaCopy );
}

Which ended up adding on some time to the DMA transfer (hopefully I got it right this time). Actually, this makes a lot more sense, since it's basically spot-on with the 4-channel DMA transfer posted above. Lemmie upload updated binaries/source :<

Updated Binary/Source

And, might as well-redo the tables~

Code:

Main Memory to Main Memory

16B 32B 64B 128B 256B 512B 1024B 2048B
memcpy 41 52 73 115 312 731 1219 3906
dmaCopy 102 174 318 365 1182 2334 4638 9246
swiCopy 123 167 221 383 770 1440 2716 6691
4dmaCopy 143 215 359 647 1223 2397 4701 9309

4096B 8192B 16384B 32768B 65536B
memcpy 11063 23697 48677 98291 196727
dmaCopy 18525 36960 73824 147624 295083
swiCopy 17789 38230 77787 155769 311512
4dmaCopy 18525 36957 73893 147625 295152

Main Memory to VRAM

16B 32B 64B 128B 256B 512B 1024B 2048B
dmaCopy 53 65 89 137 233 432 830 1641
4dmaCopy 123 139 163 211 307 499 883 1679
memcpy 58 85 139 247 463 939 1891 4001
swiCopy 168 204 360 598 1118 2124 4212 8456

4096B 8192B 16384B 32768B 65536B
dmaCopy 3256 6519 13022 25973 51950
4dmaCopy 3293 6449 12905 25941 51797
memcpy 8711 18657 38657 77907 155903
swiCopy 17828 37827 76815 153746 307480

:>

Ughhh, okay, wtf.

http://nocash.emubase.de/gbatek.htm#dmatransfers wrote:

The CPU is paused when DMA transfers are active, however, the CPU is operating during the periods when Sound/Blanking DMA transfers are paused.

So to test this, I editted out the while( dmaBusy etc ) loop in the 4-channel DMA query -

Code:

void query_quadDmaCopy() {
PROF_START();
int part_size = 1 << ( mem_size - 2 );
dmaCopyWordsAsynch( 0, psrc + 0*part_size, pdst + 0*part_size, part_size );
dmaCopyWordsAsynch( 1, psrc + 1*part_size, pdst + 1*part_size, part_size );
dmaCopyWordsAsynch( 2, psrc + 2*part_size, pdst + 2*part_size, part_size );
dmaCopyWordsAsynch( 3, psrc + 3*part_size, pdst + 3*part_size, part_size );
// while ( dmaBusy(0) || dmaBusy(1) || dmaBusy(2) || dmaBusy(3) );
PROF_END( t_quadDmaCopy );
}

It appears to work exactly the same on the hardware, whether or not that line is there, which suggests the DMA transfer might be blocking (and would also explain why this is approximately the same speed as dmaCopy).

If this is the case, then wtf are there 4 channels for? D:

(moar) auuuughhh wtf. There's actually an anomaly in here - when transferring 16384B from main memory to main memory (4KB on each DMA channel) the copy executes in 8350 units (compared to 49308 units taken by memcpy). I dunno lol :<

Mushu wrote:

I'm not sure if this is a flaw in my understanding/implementation of the copy, but it raises doubts on the idea that there is a unique DMA controller for each channel. Might just be the same controller juggling all four channels, which kills any benefit you'd normally get from parallel copies. If that's the case though, dunno why they'd have four separate channels in the first place :<

On the GBA it was
0. Raster effect
1. Secondary raster effect or stereo PCM
2. PCM
3. Immediate copies

On the DS ARM9 it might be
0. Raster effect to main screen
1. Raster effect to sub screen
2. ???
3. Immediate copies

That's what the four channels are for: transfers in modes other than immediate.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Having 4 channels when only one can run at a time is still good because if one transfer is in progress/pending, you can simply schedule one on the next channel and it'll be done automatically when the previous ones finish. Even if only one runs at a time, you can still do 2 in parallel, one by DMA and one by CPU (or CPU can do other things during transfer).

It's interesting that the CPU is paused during DMA though, which kinda kills that advantage. The Game Boy is similar: during DMA only the upper 128 bytes of RAM are accessible. The CPU keeps going but is basically forced to spin in a timed loop in that small space. Here, the advantage is that DMA is faster than a CPU copy (slow CPU), but it sounds like on NDS, that only applies for main <--> VRAM.

So if I understand correctly (probably not) and this info is all correct, on the DS, DMA is only useful for VRAM access, sound, and HDMA.
_________________
I'm a PSP hacker now, but I still <3 DS.

Cearn wrote:

This little bit from gbatek may be of interest:

From http://nocash.emubase.de/gbatek.htm#dsdmatransfers:

Quote:

NDS Sequential Main Memory DMA
Main RAM has different access time for sequential and non-sequential access. Normally DMA uses sequential access (except for the first word), however, if the source and destination addresses are both in Main RAM, then all accesses become non-sequential. In that case it would be faster to use two DMA transfers, one from Main RAM to a scratch buffer in WRAM, and one from WRAM to Main RAM.

Awesome, that makes perfect sense now.

Quote:

Also fun is adding FlushAll() right before the copies. This has quite a large effect on the results (except, of course, for DMA).

Uh, I really don't recommend doing that - DC_FlushRange() & DC_InvalidateRange() exist for a reason.

Quote:

The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?

You've lost me here. You mean memcpy uses byte size (blatently it doesn't unless source & destination are misaligned)?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

Wow, this is really interesting. I just tried it with an 8-register ldmia/stmia copy and it beats memcpy in pretty much all cases. Here is the function:

Code:

fastCopy:
@bail out for misaligned/0 size
tst r2, #31
bxne lr
cmp r2, #32
bxlt lr

stmfd sp!, {r4-r11, lr}

fastCopyLoop:
ldmia r1!, {r4-r11}
stmia r0!, {r4-r11}
subs r2, r2, #32
bgt fastCopyLoop

ldmfd sp!, {r4-r11, pc}

I also added in the DC_FlushAll before each profile, so the small sizes aren't all just cache-to-cache. But what was really surprised me is that if you align the source and dest to land on 32 byte boundaries (i.e. cache lines), it speeds it up by quite a lot.

Here are the results. fastCopy is my function, and fcAlign has source/dest aligned to 32-bytes.

Code:

Main Memory to Main Memory

16B 32B 64B 128B 256B 512B 1024B 2048B
memcpy 157 217 313 505 889 1657 3193 6265
fastCopy n/a 132 196 304 520 972 1816 3564
fcAlign n/a 151 198 292 480 856 1608 3112
dmaCopy4 169 241 385 673 1249 2401 4705 9313

4096B 8192B 16384B 32768B 65536B
memcpy 12409 24697 49273 98425 196729
fastCopy 7020 13912 27756 55384 110779
fcAlign 6120 12136 24168 48232 96439
dmaCopy4 18592 36961 73825 147625 295094

Main Memory to VRAM

16B 32B 64B 128B 256B 512B 1024B 2048B
memcpy 163 212 288 440 744 1352 2568 5000
fastCopy n/a 142 208 300 484 852 1588 3060
fcAlign n/a 158 202 290 466 818 1522 2930
dmaCopy4 120 132 156 204 300 499 897 1698

4096B 8192B 16384B 32768B 65536B
memcpy 9864 19592 39048 77960 155784
fastCopy 6004 11892 23648 47220 94322
fcAlign 5746 11378 22642 45170 90226
dmaCopy4 3312 6553 13035 25986 51974

_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

Why not adding an ASM copy function like yours in libnds? could be useful...

Pardon the bump, but I don't think it's true that only one DMA channel can run at a time. I just timed loading 256x192 of a 512x512 bitmap from VRAM to main RAM. If I only use one DMA channel it takes 145ms, while if I use all 4 and spin on the last one, it finishes in a mere 32ms - about 22% as long, which is actually more than 4 times as fast.

Specifically, what I'm doing is calling dmaCopyWordsAsynch() for channels 0, 1 and 2 and dmaCopyWords() for channel 3. Each is copying 512 bytes at a time; 98304 bytes are copied in total.

What is worth noting, though, is that having all four DMA transfers be asynchronous and doing a CPU copy while they run didn't improve performance at all.
_________________
I'm a PSP hacker now, but I still <3 DS.

Last edited by olimar on Wed Aug 20, 2008 10:47 pm; edited 1 time in total

olimar wrote:

...
If the CPU doesn't touch main ram (i.e. TCMs or cache), it's not stopped by DMA.

I thought as long as the cpu doesn't touch the bus everything is fine. since tcm/cache live on the cpu, they dont stall. But then there'd be many more ways to stall the cpu, i.e. accessing video ram, or shared ram, or memory mapped I/O, no?

Last edited by olimar on Wed Aug 20, 2008 10:47 pm; edited 1 time in total

To be specific, there are 3 kinds of "ROM" that need testing:

GBA ROM (which is not present if you use a SLOT-1 card)
DS Game Card I/O
DS BIOS

_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

gbadev.org forum archive

DS development > which is faster,swiCopy or dmaCopy??

#129188 - odelot - Sun May 20, 2007 7:09 am

#129189 - LiraNuna - Sun May 20, 2007 7:44 am

#129748 - Mushu - Sat May 26, 2007 1:03 am

#130005 - wintermute - Wed May 30, 2007 1:42 am

#130014 - Mushu - Wed May 30, 2007 4:50 am

#130319 - Mushu - Sat Jun 02, 2007 5:44 am

#130323 - mml - Sat Jun 02, 2007 8:20 am

#130334 - simonjhall - Sat Jun 02, 2007 10:12 am

#130340 - wintermute - Sat Jun 02, 2007 10:47 am

#130342 - Dark Knight ez - Sat Jun 02, 2007 11:16 am

#130343 - NeX - Sat Jun 02, 2007 11:20 am

#130344 - wintermute - Sat Jun 02, 2007 11:35 am

#130345 - wintermute - Sat Jun 02, 2007 11:38 am

#130348 - Dark Knight ez - Sat Jun 02, 2007 1:00 pm

#130355 - Lick - Sat Jun 02, 2007 3:26 pm

#130359 - wintermute - Sat Jun 02, 2007 4:07 pm

#130360 - wintermute - Sat Jun 02, 2007 4:09 pm

#130378 - Cearn - Sat Jun 02, 2007 6:57 pm

#130384 - Mushu - Sat Jun 02, 2007 7:27 pm

#130392 - tepples - Sat Jun 02, 2007 11:26 pm

#130394 - HyperHacker - Sat Jun 02, 2007 11:37 pm

#130415 - wintermute - Sun Jun 03, 2007 3:32 am

#130420 - DekuTree64 - Sun Jun 03, 2007 6:49 am

#130459 - Noda - Sun Jun 03, 2007 8:05 pm

#134024 - HyperHacker - Tue Jul 10, 2007 3:53 am

#134031 - olimar - Tue Jul 10, 2007 5:59 am

#134038 - Ant6n - Tue Jul 10, 2007 7:13 am

#134057 - olimar - Tue Jul 10, 2007 11:46 am

#134060 - tepples - Tue Jul 10, 2007 11:51 am