gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > which is faster,swiCopy or dmaCopy??

#129188 - odelot - Sun May 20, 2007 7:09 am

?? what are the differences between them??

#129189 - LiraNuna - Sun May 20, 2007 7:44 am

swiCopy is a BIOS call. it stops CPU execution and IRQs. There is believed to be a bug in one of it's modes though (called swiFastCopy) - the first 3/4N bytes will be copied fast, while the last 1/4N will be copied very slowly.

DMA copy is done by an external controller, and is also copying while CPU halted in addition to IRQs.

I remember dsboi doing a speed test to check up which of the 4 copy methods are the fastest. ASM copy, which was a 32bit ldmia writes, memcpy, DMA copy in 32bit mode and swiCopy (and swiFastCopy)

Those were the results (I don't have the actual numbers)

1) ldmia copy
2) memcpy
3) DMA copy
4) swiCopy
5) swiFastCopy

I'll have to find that benchmark .nds...
_________________
Private property.
Violators will be shot, survivors will be shot again.

#129748 - Mushu - Sat May 26, 2007 1:03 am

Hrm, I remember reading about that benchmark test, but the page I read it on seemed to be somewhat dated. Happen to know what version (or around what version) of libnds they were compiled with?

#130005 - wintermute - Wed May 30, 2007 1:42 am

The libnds version isn't at all relevant.

The gcc version *may* have some bearing on the speed but not much. The only function used for testing written in C was memcpy IIRC.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130014 - Mushu - Wed May 30, 2007 4:50 am

Oh, for some reason I thought the "bug" in swiFastCopy was a bug in libnds, rather than a malfunction in the hardware implementation. Thanks for the clarification.

#130319 - Mushu - Sat Jun 02, 2007 5:44 am

Sorry to double post, but I'm a huge attention whore, and I wanted to get some attention :>

Anyway, I've never really been satisfied with the benchmarks done on memcpy/swiCopy/dmaCopy, simply because they all seem to show that memcpy is faster than dmaCopy. dmaCopy is a memory transfer with a dedicated controller, so why the hell would it be faster than a software memcpy?

Additionally, all of the benchmarks I've ever seen only gave a single number for each of the functions, rather than testing over a variety of transfer sizes and locations (main memory vs. VRAM vs. shared memory, etc). So I decided to hunker down and do some testing of my own, and the results were... informative.

I wrote a small app (source+binary to follow) to test all power-of-two block sizes from 16-1048576 bytes, using either main memory or VRAM as src/dst, and main memory for dst/src (I didn't want to end up with the situation where the source and destination are the same address, so only 1 of them is allowed to be VRAM).

Additionally, I mapped all the VRAM to 0x6000000, assuming that it will perform the same regardless of the mapping (which makes sense to me, but may not be the case).

Oh god I have to make a table. These values were obtained with the binary (to follow) on my DS Lite; there's probably at least 5 bugs in the source, so I wouldn't call them very conclusive, but they definitely do suggest conclusions -

Code:

Main Memory to Main Memory

          16B  32B 64B 128B  256B  512B 1024B 2048B
memcpy    41   52  73  115   312   731  1219  3906
dmaCopy   94  158 286  542  1054  2118  4148  8244
swiCopy  123  167 221  383   770  1440  2716  6691

          4096B  8192B 16384B  32768B  65536B
memcpy   11063  23697  48677   98291  196727
dmaCopy  16481  32887  65655  131263  262338
swiCopy  17789  38230  77787  155769  311512


etc, you get the picture. This matches what I've read so far, memcpy > dmaCopy > swiCopy, which is expected. When you switch over to VRAM, however, a different picture emerges -

Code:

Main Memory to VRAM

          16B  32B  64B 128B  256B  512B 1024B 2048B
dmaCopy   48   56   72  104   168   324   592  1145
memcpy    58   85  139  247   463   939  1891  4001
swiCopy  168  204  360  598  1118  2124  4212  8456

          4096B  8192B 16384B  32768B  65536B
dmaCopy   2226   4375   8627   17125   34121
memcpy    8711  18657  38657   77907  155903
swiCopy  17828  37827  76815  153746  307480


Anyway, so it appears that dmaCopy is significantly faster for transferring any size of data to VRAM, while memcpy is faster when working only in main memory. I would suspect that the dma hardware doesn't actually kick in unless you're transferring to VRAM, but I have no foundation for that conclusion.

I'm sure a lot of people already knew about this (or I've made a mistake somewhere ._.) but I've always seem to have read that memcpy is faster than everything else etc, and just though that was odd.

Blarg.

Sauce + Binary
Just the Binary

:<

Oh, and controls for the thing -
* A: Double the amount of memory allocated.
* B: Halve the amount of memory allocated.
* X: Toggle destination (VRAM/Main Memory)
* Y: Toggle source (VRAM/Main Memory)

#130323 - mml - Sat Jun 02, 2007 8:20 am

I think the larger part of the DMA controller's benefit is being able to copy asynchronously (for some value thereof, anyway). But this benefit is lost when using dmaCopy(), since it spins on the control register and only returns once the copy is actually complete. Of course if you're courageous and use dmaCopyAsynch() or its variants, you then have to manage the synchronisation yourself, which in most cases is probably more pain in the arse than it's worth...

#130334 - simonjhall - Sat Jun 02, 2007 10:12 am

Yo, what are the units in that table? Good work anyway :-)

One thing I'm interested in is how exactly does the copy loop in memcpy work? Does it copy in word quantites until it has less than a word of memory to go, or does it do the whole transfer by just copying bytes?
I bet you could write a really fast asm copy assuming that the source + dest are 16/32-bit aligned and the size is some multiple of something...

Reckon you could also do the test on main memory to the uncached memory map? memcpy is probably faster to main memory because it uses the cache, but dmacopy and swicopy don't. To make it uncached OR your addresses with 0x400000.
_________________
Big thanks to everyone who donated for Quake2

#130340 - wintermute - Sat Jun 02, 2007 10:47 am

memcpy source from newlib

Code:

/*
FUNCTION
        <<memcpy>>---copy memory regions

ANSI_SYNOPSIS
        #include <string.h>
        void* memcpy(void *<[out]>, const void *<[in]>, size_t <[n]>);

TRAD_SYNOPSIS
        void *memcpy(<[out]>, <[in]>, <[n]>
        void *<[out]>;
        void *<[in]>;
        size_t <[n]>;

DESCRIPTION
        This function copies <[n]> bytes from the memory region
        pointed to by <[in]> to the memory region pointed to by
        <[out]>.

        If the regions overlap, the behavior is undefined.

RETURNS
        <<memcpy>> returns a pointer to the first byte of the <[out]>
        region.

PORTABILITY
<<memcpy>> is ANSI C.

<<memcpy>> requires no supporting OS subroutines.

QUICKREF
        memcpy ansi pure
   */

#include <_ansi.h>
#include <stddef.h>
#include <limits.h>

/* Nonzero if either X or Y is not aligned on a "long" boundary.  */
#define UNALIGNED(X, Y) \
  (((long)X & (sizeof (long) - 1)) | ((long)Y & (sizeof (long) - 1)))

/* How many bytes are copied each iteration of the 4X unrolled loop.  */
#define BIGBLOCKSIZE    (sizeof (long) << 2)

/* How many bytes are copied each iteration of the word copy loop.  */
#define LITTLEBLOCKSIZE (sizeof (long))

/* Threshhold for punting to the byte copier.  */
#define TOO_SMALL(LEN)  ((LEN) < BIGBLOCKSIZE)

_PTR
_DEFUN (memcpy, (dst0, src0, len0),
   _PTR dst0 _AND
   _CONST _PTR src0 _AND
   size_t len0)
{
#if defined(PREFER_SIZE_OVER_SPEED) || defined(__OPTIMIZE_SIZE__)
  char *dst = (char *) dst0;
  char *src = (char *) src0;

  _PTR save = dst0;

  while (len0--)
    {
      *dst++ = *src++;
    }

  return save;
#else
  char *dst = dst0;
  _CONST char *src = src0;
  long *aligned_dst;
  _CONST long *aligned_src;
  int   len =  len0;

  /* If the size is small, or either SRC or DST is unaligned,
     then punt into the byte copy loop.  This should be rare.  */
  if (!TOO_SMALL(len) && !UNALIGNED (src, dst))
    {
      aligned_dst = (long*)dst;
      aligned_src = (long*)src;

      /* Copy 4X long words at a time if possible.  */
      while (len >= BIGBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          *aligned_dst++ = *aligned_src++;
          len -= BIGBLOCKSIZE;
        }

      /* Copy one long word at a time if possible.  */
      while (len >= LITTLEBLOCKSIZE)
        {
          *aligned_dst++ = *aligned_src++;
          len -= LITTLEBLOCKSIZE;
        }

       /* Pick up any residual with a byte copier.  */
      dst = (char*)aligned_dst;
      src = (char*)aligned_src;
    }

  while (len--)
    *dst++ = *src++;

  return dst0;
#endif /* not PREFER_SIZE_OVER_SPEED */
}


You can certainly write a fast asm copy like that but it won't be a drop in replacement for memcpy.

Why wouldn't swiCopy use the cache?

The most interesting thing about these tests that I noticed was this.

Code:

                        1K    2K
dmaCopy main to main  4148  8244
dmaCopy main to VRAM   592  1145


Notice anything there?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130342 - Dark Knight ez - Sat Jun 02, 2007 11:16 am

You mean besides being 8 times more quick?
_________________
AmplituDS website

#130343 - NeX - Sat Jun 02, 2007 11:20 am

Does it really matter how quick it is when most people are using MicroSD cards, some of the slowest on the planet?

#130344 - wintermute - Sat Jun 02, 2007 11:35 am

Dark Knight ez wrote:
You mean besides being 8 times more quick?


Which suggests what?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130345 - wintermute - Sat Jun 02, 2007 11:38 am

NeX wrote:
Does it really matter how quick it is when most people are using MicroSD cards, some of the slowest on the planet?


Most slot 2 cards load everything into gba cart space before executing the app.

Data doesn't necessarily have to be loaded from the filesystem *every* time you want to use it. That all depends on how well you've thought things out.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130348 - Dark Knight ez - Sat Jun 02, 2007 1:00 pm

wintermute wrote:
Dark Knight ez wrote:
You mean besides being 8 times more quick?


Which suggests what?

Erm... that dmaCopy can copy entire blocks of 8*2bytes at one time to VRAM as opposed to just 2bytes at a time like memcpy does? Or am I missing something?
_________________
AmplituDS website

#130355 - Lick - Sat Jun 02, 2007 3:26 pm

What happens when you cache the VRAM? x_x
_________________
http://licklick.wordpress.com

#130359 - wintermute - Sat Jun 02, 2007 4:07 pm

Lick wrote:
What happens when you cache the VRAM? x_x


Your graphics break.

Not entirely sure what you're getting at either.

Cache would only affect the CPU copies, DMA doesn't go through the cache - this is why cache needs to be flushed before a DMA transfer when the source memory is cached and invalidated after if the destination is cached.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog


Last edited by wintermute on Sat Jun 02, 2007 4:13 pm; edited 1 time in total

#130360 - wintermute - Sat Jun 02, 2007 4:09 pm

Dark Knight ez wrote:

Erm... that dmaCopy can copy entire blocks of 8*2bytes at one time to VRAM as opposed to just 2bytes at a time like memcpy does? Or am I missing something?


I'm not really sure I follow your logic here.

The dma from main RAM to VRAM is approximately 8 times faster than from main RAM to main RAM. This implies that copying from main RAM to VRAM can copy 8 times as much data in the same time period. Nothing can be inferred about the amount of data transferred in a single chunk.

Assuming that VRAM to main RAM is the same speed then DMA from main to VRAM & back to main should theoretically be 4 times faster than main to main.

Right now I'm not sure why this should be the case. I'm open to suggestions.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130378 - Cearn - Sat Jun 02, 2007 6:57 pm

This little bit from gbatek may be of interest:

From http://nocash.emubase.de/gbatek.htm#dsdmatransfers:
Quote:
NDS Sequential Main Memory DMA
Main RAM has different access time for sequential and non-sequential access. Normally DMA uses sequential access (except for the first word), however, if the source and destination addresses are both in Main RAM, then all accesses become non-sequential. In that case it would be faster to use two DMA transfers, one from Main RAM to a scratch buffer in WRAM, and one from WRAM to Main RAM.


Unfortunately, I can't find what the waitstates for DS sections are in gbatek :\

Also fun is adding FlushAll() right before the copies. This has quite a large effect on the results (except, of course, for DMA).

The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?

#130384 - Mushu - Sat Jun 02, 2007 7:27 pm

mml wrote:
I think the larger part of the DMA controller's benefit is being able to copy asynchronously (for some value thereof, anyway). But this benefit is lost when using dmaCopy(), since it spins on the control register and only returns once the copy is actually complete. Of course if you're courageous and use dmaCopyAsynch() or its variants, you then have to manage the synchronisation yourself, which in most cases is probably more pain in the arse than it's worth...

Yeah, I thought the same thing at first too, since I figured there'd be four separate DMA controllers, each controlled by one of the registers. In my tests though, this query -

Code:
void query_quadDmaCopy() {
   PROF_START();
   int part_size = 1 << ( mem_size - 2 );
   dmaCopyWordsAsynch( 0, psrc + 0*part_size, pdst + 0*part_size, part_size );
   dmaCopyWordsAsynch( 1, psrc + 1*part_size, pdst + 1*part_size, part_size );
   dmaCopyWordsAsynch( 2, psrc + 2*part_size, pdst + 2*part_size, part_size );
   dmaCopyWordsAsynch( 3, psrc + 3*part_size, pdst + 3*part_size, part_size );
   while ( dmaBusy(0) || dmaBusy(1) || dmaBusy(2) || dmaBusy(3) );
   PROF_END( t_quadDmaCopy );
}


Always had a higher delay than the single DMA transfer. I'm not sure if this is a flaw in my understanding/implementation of the copy, but it raises doubts on the idea that there is a unique DMA controller for each channel. Might just be the same controller juggling all four channels, which kills any benefit you'd normally get from parallel copies. If that's the case though, dunno why they'd have four separate channels in the first place :<

Well. I guess it would be useful if you were loading things in the background (like changing tilesets without a loading screen!) D:

If you want the binary/source with the 4-channel DMA transfer (or just numbers) I can post those too.

simonjhall wrote:
Yo, what are the units in that table?

lol I actually have no idea, I just stole the timer code straight from TONC since I didn't want to have to figure out how to properly set up the timers with the right frequency so they don't overflow, cascaded right, etc. It doesn't really matter what the units are for comparison purposes, since all of the tests use the same timer setup :3

<3 TONC.

Cearn wrote:
The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?

Whoooops. Changed the dma query function from -

Code:
void query_dmaCopy() {   
   PROF_START();
   dmaCopy( psrc, pdst, 1 << (mem_size-1) );
   PROF_END( t_dmaCopy );
}


to

Code:
void query_dmaCopy() {   
   PROF_START();
   dmaCopyWords( 3, psrc, pdst, 1 << (mem_size) );
   PROF_END( t_dmaCopy );
}


Which ended up adding on some time to the DMA transfer (hopefully I got it right this time). Actually, this makes a lot more sense, since it's basically spot-on with the 4-channel DMA transfer posted above. Lemmie upload updated binaries/source :<

Updated Binary/Source

And, might as well-redo the tables~

Code:
Main Memory to Main Memory

          16B  32B 64B 128B  256B  512B 1024B 2048B
memcpy    41   52  73  115   312   731  1219  3906
dmaCopy  102  174 318  365  1182  2334  4638  9246
swiCopy  123  167 221  383   770  1440  2716  6691
4dmaCopy 143  215 359  647  1223  2397  4701  9309

          4096B  8192B 16384B  32768B  65536B
memcpy   11063  23697  48677   98291  196727
dmaCopy  18525  36960  73824  147624  295083
swiCopy  17789  38230  77787  155769  311512
4dmaCopy 18525  36957  73893  147625  295152

Main Memory to VRAM

          16B  32B  64B 128B  256B  512B 1024B 2048B
dmaCopy   53   65   89  137   233   432   830  1641
4dmaCopy 123  139  163  211   307   499   883  1679
memcpy    58   85  139  247   463   939  1891  4001
swiCopy  168  204  360  598  1118  2124  4212  8456

          4096B  8192B 16384B  32768B  65536B
dmaCopy   3256   6519  13022   25973   51950
4dmaCopy  3293   6449  12905   25941   51797
memcpy    8711  18657  38657   77907  155903
swiCopy  17828  37827  76815  153746  307480


:>

Ughhh, okay, wtf.
http://nocash.emubase.de/gbatek.htm#dmatransfers wrote:
The CPU is paused when DMA transfers are active, however, the CPU is operating during the periods when Sound/Blanking DMA transfers are paused.


So to test this, I editted out the while( dmaBusy etc ) loop in the 4-channel DMA query -

Code:
void query_quadDmaCopy() {
   PROF_START();
   int part_size = 1 << ( mem_size - 2 );
   dmaCopyWordsAsynch( 0, psrc + 0*part_size, pdst + 0*part_size, part_size );
   dmaCopyWordsAsynch( 1, psrc + 1*part_size, pdst + 1*part_size, part_size );
   dmaCopyWordsAsynch( 2, psrc + 2*part_size, pdst + 2*part_size, part_size );
   dmaCopyWordsAsynch( 3, psrc + 3*part_size, pdst + 3*part_size, part_size );
   // while ( dmaBusy(0) || dmaBusy(1) || dmaBusy(2) || dmaBusy(3) );
   PROF_END( t_quadDmaCopy );
}


It appears to work exactly the same on the hardware, whether or not that line is there, which suggests the DMA transfer might be blocking (and would also explain why this is approximately the same speed as dmaCopy).

If this is the case, then wtf are there 4 channels for? D:

(moar) auuuughhh wtf. There's actually an anomaly in here - when transferring 16384B from main memory to main memory (4KB on each DMA channel) the copy executes in 8350 units (compared to 49308 units taken by memcpy). I dunno lol :<

#130392 - tepples - Sat Jun 02, 2007 11:26 pm

Mushu wrote:
I'm not sure if this is a flaw in my understanding/implementation of the copy, but it raises doubts on the idea that there is a unique DMA controller for each channel. Might just be the same controller juggling all four channels, which kills any benefit you'd normally get from parallel copies. If that's the case though, dunno why they'd have four separate channels in the first place :<

On the GBA it was
0. Raster effect
1. Secondary raster effect or stereo PCM
2. PCM
3. Immediate copies

On the DS ARM9 it might be
0. Raster effect to main screen
1. Raster effect to sub screen
2. ???
3. Immediate copies

That's what the four channels are for: transfers in modes other than immediate.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#130394 - HyperHacker - Sat Jun 02, 2007 11:37 pm

Having 4 channels when only one can run at a time is still good because if one transfer is in progress/pending, you can simply schedule one on the next channel and it'll be done automatically when the previous ones finish. Even if only one runs at a time, you can still do 2 in parallel, one by DMA and one by CPU (or CPU can do other things during transfer).

It's interesting that the CPU is paused during DMA though, which kinda kills that advantage. The Game Boy is similar: during DMA only the upper 128 bytes of RAM are accessible. The CPU keeps going but is basically forced to spin in a timed loop in that small space. Here, the advantage is that DMA is faster than a CPU copy (slow CPU), but it sounds like on NDS, that only applies for main <--> VRAM.

So if I understand correctly (probably not) and this info is all correct, on the DS, DMA is only useful for VRAM access, sound, and HDMA.
_________________
I'm a PSP hacker now, but I still <3 DS.

#130415 - wintermute - Sun Jun 03, 2007 3:32 am

Cearn wrote:
This little bit from gbatek may be of interest:

From http://nocash.emubase.de/gbatek.htm#dsdmatransfers:
Quote:
NDS Sequential Main Memory DMA
Main RAM has different access time for sequential and non-sequential access. Normally DMA uses sequential access (except for the first word), however, if the source and destination addresses are both in Main RAM, then all accesses become non-sequential. In that case it would be faster to use two DMA transfers, one from Main RAM to a scratch buffer in WRAM, and one from WRAM to Main RAM.




Awesome, that makes perfect sense now.


Quote:

Also fun is adding FlushAll() right before the copies. This has quite a large effect on the results (except, of course, for DMA).


Uh, I really don't recommend doing that - DC_FlushRange() & DC_InvalidateRange() exist for a reason.


Quote:

The comparisons aren't quite correct though. dmaCopy() uses the byte-size, not the halfword size, so the figures for DMA are inflated (or, rather, deflated). Also, why use halfword copies instead of word copies?


You've lost me here. You mean memcpy uses byte size (blatently it doesn't unless source & destination are misaligned)?
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#130420 - DekuTree64 - Sun Jun 03, 2007 6:49 am

Wow, this is really interesting. I just tried it with an 8-register ldmia/stmia copy and it beats memcpy in pretty much all cases. Here is the function:
Code:
fastCopy:
@bail out for misaligned/0 size
tst r2, #31
bxne lr
cmp r2, #32
bxlt lr

stmfd sp!, {r4-r11, lr}

fastCopyLoop:
ldmia r1!, {r4-r11}
stmia r0!, {r4-r11}
subs r2, r2, #32
bgt fastCopyLoop

ldmfd sp!, {r4-r11, pc}

I also added in the DC_FlushAll before each profile, so the small sizes aren't all just cache-to-cache. But what was really surprised me is that if you align the source and dest to land on 32 byte boundaries (i.e. cache lines), it speeds it up by quite a lot.

Here are the results. fastCopy is my function, and fcAlign has source/dest aligned to 32-bytes.
Code:
Main Memory to Main Memory

          16B  32B 64B 128B  256B  512B 1024B 2048B
memcpy    157  217 313  505   889  1657  3193  6265
fastCopy  n/a  132 196  304   520   972  1816  3564
fcAlign   n/a  151 198  292   480   856  1608  3112
dmaCopy4  169  241 385  673  1249  2401  4705  9313

          4096B  8192B 16384B  32768B  65536B
memcpy    12409  24697  49273   98425  196729
fastCopy   7020  13912  27756   55384  110779
fcAlign    6120  12136  24168   48232  96439
dmaCopy4  18592  36961  73825  147625  295094

Main Memory to VRAM

          16B  32B  64B 128B  256B  512B 1024B 2048B
memcpy    163  212  288  440   744  1352  2568  5000
fastCopy  n/a  142  208  300   484   852  1588  3060
fcAlign   n/a  158  202  290   466   818  1522  2930
dmaCopy4  120  132  156  204   300   499   897  1698

          4096B  8192B 16384B  32768B  65536B
memcpy    9864   19592  39048   77960  155784
fastCopy  6004   11892  23648   47220   94322
fcAlign   5746   11378  22642   45170   90226
dmaCopy4  3312    6553  13035   25986   51974

_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#130459 - Noda - Sun Jun 03, 2007 8:05 pm

Why not adding an ASM copy function like yours in libnds? could be useful...

#134024 - HyperHacker - Tue Jul 10, 2007 3:53 am

Pardon the bump, but I don't think it's true that only one DMA channel can run at a time. I just timed loading 256x192 of a 512x512 bitmap from VRAM to main RAM. If I only use one DMA channel it takes 145ms, while if I use all 4 and spin on the last one, it finishes in a mere 32ms - about 22% as long, which is actually more than 4 times as fast.

Specifically, what I'm doing is calling dmaCopyWordsAsynch() for channels 0, 1 and 2 and dmaCopyWords() for channel 3. Each is copying 512 bytes at a time; 98304 bytes are copied in total.

What is worth noting, though, is that having all four DMA transfers be asynchronous and doing a CPU copy while they run didn't improve performance at all.
_________________
I'm a PSP hacker now, but I still <3 DS.

#134031 - olimar - Tue Jul 10, 2007 5:59 am



Last edited by olimar on Wed Aug 20, 2008 10:47 pm; edited 1 time in total

#134038 - Ant6n - Tue Jul 10, 2007 7:13 am

olimar wrote:
...
If the CPU doesn't touch main ram (i.e. TCMs or cache), it's not stopped by DMA.

I thought as long as the cpu doesn't touch the bus everything is fine. since tcm/cache live on the cpu, they dont stall. But then there'd be many more ways to stall the cpu, i.e. accessing video ram, or shared ram, or memory mapped I/O, no?

#134057 - olimar - Tue Jul 10, 2007 11:46 am



Last edited by olimar on Wed Aug 20, 2008 10:47 pm; edited 1 time in total

#134060 - tepples - Tue Jul 10, 2007 11:51 am

To be specific, there are 3 kinds of "ROM" that need testing:
  • GBA ROM (which is not present if you use a SLOT-1 card)
  • DS Game Card I/O
  • DS BIOS

_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.