gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > swiCopy problem?

#148839 - JanoSicek - Fri Jan 11, 2008 1:40 am

I just ran into a problem with swiCopy of big block of tiles from MAIN ram to VRAM

I wanted to copy 64kB, that is 16384 words to copy

However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

Memcpy was glitchy too.

I ended copying it in small 64B blocks....

Am I doing something wrong or is this behavior known?

#148856 - simonjhall - Fri Jan 11, 2008 11:06 am

When the banks are mapped into memory are they 'next' to each other in the address space? Is there a gap between the two blocks? Cos if there is then there's no way a single copy call is going to work :-D

Anyway the reason you're getting problems with memcpy is probably because it's copying the data one byte at a time yet you can only write data to that memory in 16- or 32-bit quantities. Just use a for loop that moves data in short quantities. Avoid DMA copies until you've gotten it working with the boring C copy and are aware of the implications of using DMA.

Peace.
_________________
Big thanks to everyone who donated for Quake2

#148859 - nipil - Fri Jan 11, 2008 1:59 pm

simonjhall wrote:
Anyway the reason you're getting problems with memcpy is probably because it's copying the data one byte at a time yet you can only write data to that memory in 16- or 32-bit quantities. Just use a for loop that moves data in short quantities. Avoid DMA copies until you've gotten it working with the boring C copy and are aware of the implications of using DMA.


Just a remark : it's not because memcpy specifies the data length bytes that memcpy actually copies data one byte at a time. It's perfectly doable to copy most of it via 4-bytes access, and the remaining using either 2+1 or 3x1 as needed. AFAIK i never had any problem with memcpy, whatever the size, alignement and whether it be VRAM or not ;)

Second remark about memcpy/for-loop/dma/swiFastCopy : the update at the end of this wiki entry just shows memcpy is the fastest. And for-loop the slowest... Besides, you can't implement a memcpy the wrong way ;)

#148860 - JanoSicek - Fri Jan 11, 2008 2:50 pm

How does swiCopy work? Is it the same as DMA copy?
I know about some problems with DMA copy, so I try to avoid it until necessary. In my results the swiCopy produced fastest output, faster than memcpy...

VRAM banks are mapped next to each other
VRAM A and VRAM D, where D has offset 6020000

#148874 - eKid - Fri Jan 11, 2008 5:03 pm

JanoSicek wrote:
How does swiCopy work? Is it the same as DMA copy?

swiCopy is a function to call the 'CpuSet' function in the BIOS. It doesn't use DMA.

JanoSicek wrote:
However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

How are you using swiCopy? It should be able to copy way more than 48kB (gbatek says 20-bit word count).

#148877 - JanoSicek - Fri Jan 11, 2008 5:10 pm

eKid wrote:
JanoSicek wrote:
How does swiCopy work? Is it the same as DMA copy?

swiCopy is a function to call the 'CpuSet' function in the BIOS. It doesn't use DMA.

JanoSicek wrote:
However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

How are you using swiCopy? It should be able to copy way more than 48kB (gbatek says 20-bit word count).


bg->vramoffset=(u16*)(BG_TILE_RAM(tilebase));
swiCopy((void*)ptiles,(void*)bg->vramoffset,16384); //i tried 32768 or 65536, neither did work

like this
when i copy to tilebase 5, which should cover bases 5,6,7 and 8
the tiles in base 8 are blank

later when i copy tiles there one by one, they work, so bank 8 is mapped


vramSetBankA(VRAM_A_MAIN_BG);
vramSetBankB(VRAM_B_MAIN_SPRITE_0x06420000);
vramSetBankC(VRAM_C_SUB_BG);
vramSetBankD(VRAM_D_MAIN_BG_0x06020000);

#148953 - simonjhall - Sat Jan 12, 2008 2:02 pm

nipil wrote:
Just a remark : it's not because memcpy specifies the data length bytes that memcpy actually copies data one byte at a time. It's perfectly doable to copy most of it via 4-bytes access, and the remaining using either 2+1 or 3x1 as needed.
That only works if the alignments match. Assuming the 'bulk' of the copy is done with 32-bit writes (seems likely) then it will only be able to use this fast path if both addresses have the same 32-bit alignment. If not, a byte copy normally gets used instead.

eg,
Code:
memcpy((void* )0x200000, (void *)0x300000, 12345)
will do the bulk with 32-bit writes but
Code:
memcpy((void* )0x200001, (void *)0x300000, 12345)
will normally be done completely with byte writes in the absense of any byte-rearranging instructions.
finally
Code:
memcpy((void* )0x200001, (void *)0x300001, 12345)
will normally be done with the fast path, since both addresses have the same alignment.
_________________
Big thanks to everyone who donated for Quake2

#148964 - Cearn - Sat Jan 12, 2008 6:04 pm

simonjhall wrote:

finally
Code:
memcpy((void* )0x200001, (void *)0x300001, 12345)
will normally be done with the fast path, since both addresses have the same alignment.
This last part unfortunately isn't true. At least not with the memcpy() found in devkitARM. DKA's memcpy() will use the faster way under two conditions:
  • If the size is higher or equal to 16
  • If both the source and destination are word-aligned. The actual test used is (((u32)src | (u32)dest) & 3)==0.
Byte-copies are used in all other cases.

#148977 - simonjhall - Sun Jan 13, 2008 12:12 am

Wot I mean is - *normally* if the alignment is not word aligned, yet the source and destimation alignments match (like in that final example) first first three bytes would have been handled with a byte write, then the rest of it (apart from the end) would be done with word writes. The end bit would then be done with byte writes.
However I've not looked at the DKA version of memcpy but this is what every version I have the source to does. If the DS memcpy doesn't support this extra path then maybe it should?

Either way, I wouldn't use memcpy on 16/32-bit only memory unless you're sure of what's going on.
_________________
Big thanks to everyone who donated for Quake2

#148996 - nipil - Sun Jan 13, 2008 11:20 am

Another point with memcpy : even if you have good byte alignement, when you're accessing video memory, memcpy behaviour depends on the number of bytes written.

- under 16 bytes, "nothing" is written (ie byte access on video ram = "nothing")
- 16 <= len < 20 : 16 pixels are written. remainder 1 to 3 pixels are not set.
- goes on with 4-pixels steps...

While the "step" effect is normal (16 or 32 bit VRAM accesses) the "less than 16" effect is actually surprising. Furthermore, i had thought the step size would have been 2 pixels, as i thought memcpy used 32 bits accesses for main loop, then a 16bit and 8bit to fill the remainders. Here's a capture to show what i mean (taken from no$gba, tested same behaviour on real DS) :

[Images not permitted - Click here to view it]

Here's the test code used to show what i mean :

Code:
#include <nds.h>
#include <cstring>

int main(void) {
   powerON(POWER_ALL_2D);
   videoSetMode(MODE_5_2D | DISPLAY_BG3_ACTIVE);
   vramSetBankA(VRAM_A_MAIN_BG_0x06000000);

   BG_PALETTE[0] = RGB15(31,31,31);
   BG_PALETTE[1] = RGB15(0,0,0);

   BG3_CR = BG_BMP8_256x256 | BG_BMP_BASE(0) | BG_PRIORITY(2);
   
   BG3_XDX = 1 << 8;
   BG3_XDY = 0;
   BG3_YDX = 0;
   BG3_YDY = 1 << 8;
   BG3_CX = 0;
   BG3_CY = 0;

   uint16* screen = (uint16*) BG_BMP_RAM(0);
   memset(screen, 0, SCREEN_WIDTH*SCREEN_HEIGHT);

   uint8 data[256];
   memset (data, 1, 256);

   for (int i=0; i < 192; i++) {
      memcpy(screen, data, i);
      screen += 128; // 256px = 128 uint16
   }

   return 0;
}


Anyway, my advice would then be : use memcpy when you are sure it's aligned (ie both src and dst base addresses are multiple of 4, and the length is a multiple of 4)

#149111 - JanoSicek - Tue Jan 15, 2008 2:53 pm

I did some testing on loading individual tiles (64 bytes)
When I was loading full screen of tiles, this took:
856 microseconds for SwiCopy (COPY MODE WORD & HWORD are similar results)
3174 microseconds for memcpy
1700 microseconds for DMAcopy (WORDS/HALFWORDS, asynch/synch)

I am quite curious about the wiki remark about memcpy being the fastest.
For now, I'm using SwiCopy.

#149114 - knight0fdragon - Tue Jan 15, 2008 4:19 pm

Mushu wrote:
Sorry to double post, but I'm a huge attention whore, and I wanted to get some attention :>

Anyway, I've never really been satisfied with the benchmarks done on memcpy/swiCopy/dmaCopy, simply because they all seem to show that memcpy is faster than dmaCopy. dmaCopy is a memory transfer with a dedicated controller, so why the hell would it be faster than a software memcpy?

Additionally, all of the benchmarks I've ever seen only gave a single number for each of the functions, rather than testing over a variety of transfer sizes and locations (main memory vs. VRAM vs. shared memory, etc). So I decided to hunker down and do some testing of my own, and the results were... informative.

I wrote a small app (source+binary to follow) to test all power-of-two block sizes from 16-1048576 bytes, using either main memory or VRAM as src/dst, and main memory for dst/src (I didn't want to end up with the situation where the source and destination are the same address, so only 1 of them is allowed to be VRAM).

Additionally, I mapped all the VRAM to 0x6000000, assuming that it will perform the same regardless of the mapping (which makes sense to me, but may not be the case).

Oh god I have to make a table. These values were obtained with the binary (to follow) on my DS Lite; there's probably at least 5 bugs in the source, so I wouldn't call them very conclusive, but they definitely do suggest conclusions -

Code:

Main Memory to Main Memory

          16B  32B 64B 128B  256B  512B 1024B 2048B
memcpy    41   52  73  115   312   731  1219  3906
dmaCopy   94  158 286  542  1054  2118  4148  8244
swiCopy  123  167 221  383   770  1440  2716  6691

          4096B  8192B 16384B  32768B  65536B
memcpy   11063  23697  48677   98291  196727
dmaCopy  16481  32887  65655  131263  262338
swiCopy  17789  38230  77787  155769  311512


etc, you get the picture. This matches what I've read so far, memcpy > dmaCopy > swiCopy, which is expected. When you switch over to VRAM, however, a different picture emerges -

Code:

Main Memory to VRAM

          16B  32B  64B 128B  256B  512B 1024B 2048B
dmaCopy   48   56   72  104   168   324   592  1145
memcpy    58   85  139  247   463   939  1891  4001
swiCopy  168  204  360  598  1118  2124  4212  8456

          4096B  8192B 16384B  32768B  65536B
dmaCopy   2226   4375   8627   17125   34121
memcpy    8711  18657  38657   77907  155903
swiCopy  17828  37827  76815  153746  307480


Anyway, so it appears that dmaCopy is significantly faster for transferring any size of data to VRAM, while memcpy is faster when working only in main memory. I would suspect that the dma hardware doesn't actually kick in unless you're transferring to VRAM, but I have no foundation for that conclusion.

I'm sure a lot of people already knew about this (or I've made a mistake somewhere ._.) but I've always seem to have read that memcpy is faster than everything else etc, and just though that was odd.

Blarg.

Sauce + Binary
Just the Binary

:<

Oh, and controls for the thing -
* A: Double the amount of memory allocated.
* B: Halve the amount of memory allocated.
* X: Toggle destination (VRAM/Main Memory)
* Y: Toggle source (VRAM/Main Memory)


this is from http://forum.gbadev.org/viewtopic.php?t=13242


source is there if you want to add more tests to it
_________________
http://www.myspace.com/knight0fdragonds

MK DS FC: Dragon 330772 075464
AC WW FC: Anthony SamsClub 1933-3433-9458
MPFH: Dragon 0215 4231 1206

#149130 - nipil - Tue Jan 15, 2008 8:21 pm

Very informative. I too had to rely on speed gossips. Thanks for the link.

#149139 - JanoSicek - Tue Jan 15, 2008 10:39 pm

Nice benchmarks, however in my case the swiCopy is the fastest solution!
Why is this?
The source of the data lies in the const u8 [100000] array with bitmap, in which part of the memory does this lie? Target is the VRAM.

EDIT: I just checked and the source is in 0x2........ which is main RAM, so the dmaCopy should be fastest. However it is not :(

#149141 - JanoSicek - Tue Jan 15, 2008 11:24 pm

DOH!

My problem was the emulator!

Copying RAM->VRAM
Emulator NO$GBA 64 bytes
Swi (22)
dma (83)
memcpy (156)
Emulator NO$GBA 64kbytes
Swi (22) !!
dma (32819)
memcpy (120870)

Don't trust the emulator :) Lesson learned for me :)
On the hardware, indeed DMAcopy rules them all, with async option being the faster.

#149154 - tepples - Wed Jan 16, 2008 1:29 am

To make timing of SWI calls in NO$GBA somewhat more accurate, you can dump a BIOS from your DS and use that.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#149158 - kusma - Wed Jan 16, 2008 2:07 am

tepples: How would one go about to do that? I found a program that dumped to the save-ram for GBA, but I haven't seen any solutions for NDS.

#149161 - tepples - Wed Jan 16, 2008 2:33 am

The only DS BIOS dumper I could find that's not on one of the Three Forbidden Sites is this one. SLOT-2 only, not GBAMP. If it came with source code, I would have ported it to libfat+DLDI like I did with the GBA BIOS dumper, but it doesn't.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#149172 - OSW - Wed Jan 16, 2008 9:43 am

maybe this is helpful?
http://nds.cmamod.com/2007/01/24/dsbf_dump-79-bios-firmware-dumper/

#149174 - simonjhall - Wed Jan 16, 2008 10:24 am

I was just reading the source in there to dump the ARM7 ROM - yikes! How did someone figure out that those two instructions were there?!
Some people are just far too good!
_________________
Big thanks to everyone who donated for Quake2