gbadev.org forum archive

I just ran into a problem with swiCopy of big block of tiles from MAIN ram to VRAM

I wanted to copy 64kB, that is 16384 words to copy

However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

Memcpy was glitchy too.

I ended copying it in small 64B blocks....

Am I doing something wrong or is this behavior known?

When the banks are mapped into memory are they 'next' to each other in the address space? Is there a gap between the two blocks? Cos if there is then there's no way a single copy call is going to work :-D

Anyway the reason you're getting problems with memcpy is probably because it's copying the data one byte at a time yet you can only write data to that memory in 16- or 32-bit quantities. Just use a for loop that moves data in short quantities. Avoid DMA copies until you've gotten it working with the boring C copy and are aware of the implications of using DMA.

Peace.
_________________
Big thanks to everyone who donated for Quake2

simonjhall wrote:

Anyway the reason you're getting problems with memcpy is probably because it's copying the data one byte at a time yet you can only write data to that memory in 16- or 32-bit quantities. Just use a for loop that moves data in short quantities. Avoid DMA copies until you've gotten it working with the boring C copy and are aware of the implications of using DMA.

Just a remark : it's not because memcpy specifies the data length bytes that memcpy actually copies data one byte at a time. It's perfectly doable to copy most of it via 4-bytes access, and the remaining using either 2+1 or 3x1 as needed. AFAIK i never had any problem with memcpy, whatever the size, alignement and whether it be VRAM or not ;)

Second remark about memcpy/for-loop/dma/swiFastCopy : the update at the end of this wiki entry just shows memcpy is the fastest. And for-loop the slowest... Besides, you can't implement a memcpy the wrong way ;)

How does swiCopy work? Is it the same as DMA copy?
I know about some problems with DMA copy, so I try to avoid it until necessary. In my results the swiCopy produced fastest output, faster than memcpy...

VRAM banks are mapped next to each other
VRAM A and VRAM D, where D has offset 6020000

JanoSicek wrote:

How does swiCopy work? Is it the same as DMA copy?

swiCopy is a function to call the 'CpuSet' function in the BIOS. It doesn't use DMA.

JanoSicek wrote:

However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

How are you using swiCopy? It should be able to copy way more than 48kB (gbatek says 20-bit word count).

eKid wrote:

JanoSicek wrote:

How does swiCopy work? Is it the same as DMA copy?

swiCopy is a function to call the 'CpuSet' function in the BIOS. It doesn't use DMA.

JanoSicek wrote:

However the copying always stopped at 48kB boundary, which is exactly between two VRAM banks.

How are you using swiCopy? It should be able to copy way more than 48kB (gbatek says 20-bit word count).

bg->vramoffset=(u16*)(BG_TILE_RAM(tilebase));
swiCopy((void*)ptiles,(void*)bg->vramoffset,16384); //i tried 32768 or 65536, neither did work

like this
when i copy to tilebase 5, which should cover bases 5,6,7 and 8
the tiles in base 8 are blank

later when i copy tiles there one by one, they work, so bank 8 is mapped

vramSetBankA(VRAM_A_MAIN_BG);
vramSetBankB(VRAM_B_MAIN_SPRITE_0x06420000);
vramSetBankC(VRAM_C_SUB_BG);
vramSetBankD(VRAM_D_MAIN_BG_0x06020000);

nipil wrote:

Just a remark : it's not because memcpy specifies the data length bytes that memcpy actually copies data one byte at a time. It's perfectly doable to copy most of it via 4-bytes access, and the remaining using either 2+1 or 3x1 as needed.

That only works if the alignments match. Assuming the 'bulk' of the copy is done with 32-bit writes (seems likely) then it will only be able to use this fast path if both addresses have the same 32-bit alignment. If not, a byte copy normally gets used instead.

eg,

Code:

memcpy((void* )0x200000, (void *)0x300000, 12345)

will do the bulk with 32-bit writes but

Code:

memcpy((void* )0x200001, (void *)0x300000, 12345)

will normally be done completely with byte writes in the absense of any byte-rearranging instructions.
finally

Code:

memcpy((void* )0x200001, (void *)0x300001, 12345)

will normally be done with the fast path, since both addresses have the same alignment.
_________________
Big thanks to everyone who donated for Quake2

simonjhall wrote:

finally

Code:

memcpy((void* )0x200001, (void *)0x300001, 12345)

will normally be done with the fast path, since both addresses have the same alignment.

This last part unfortunately isn't true. At least not with the memcpy() found in devkitARM. DKA's memcpy() will use the faster way under two conditions:

If the size is higher or equal to 16
If both the source and destination are word-aligned. The actual test used is (((u32)src | (u32)dest) & 3)==0.

Byte-copies are used in all other cases.

Wot I mean is - *normally* if the alignment is not word aligned, yet the source and destimation alignments match (like in that final example) first first three bytes would have been handled with a byte write, then the rest of it (apart from the end) would be done with word writes. The end bit would then be done with byte writes.
However I've not looked at the DKA version of memcpy but this is what every version I have the source to does. If the DS memcpy doesn't support this extra path then maybe it should?

Either way, I wouldn't use memcpy on 16/32-bit only memory unless you're sure of what's going on.
_________________
Big thanks to everyone who donated for Quake2

Another point with memcpy : even if you have good byte alignement, when you're accessing video memory, memcpy behaviour depends on the number of bytes written.

- under 16 bytes, "nothing" is written (ie byte access on video ram = "nothing")
- 16 <= len < 20 : 16 pixels are written. remainder 1 to 3 pixels are not set.
- goes on with 4-pixels steps...

While the "step" effect is normal (16 or 32 bit VRAM accesses) the "less than 16" effect is actually surprising. Furthermore, i had thought the step size would have been 2 pixels, as i thought memcpy used 32 bits accesses for main loop, then a 16bit and 8bit to fill the remainders. Here's a capture to show what i mean (taken from no$gba, tested same behaviour on real DS) :

[Images not permitted - Click here to view it]

Here's the test code used to show what i mean :

Code:

#include <nds.h>
#include <cstring>

int main(void) {
powerON(POWER_ALL_2D);
videoSetMode(MODE_5_2D | DISPLAY_BG3_ACTIVE);
vramSetBankA(VRAM_A_MAIN_BG_0x06000000);

BG_PALETTE[0] = RGB15(31,31,31);
BG_PALETTE[1] = RGB15(0,0,0);

BG3_CR = BG_BMP8_256x256 | BG_BMP_BASE(0) | BG_PRIORITY(2);

BG3_XDX = 1 << 8;
BG3_XDY = 0;
BG3_YDX = 0;
BG3_YDY = 1 << 8;
BG3_CX = 0;
BG3_CY = 0;

uint16* screen = (uint16*) BG_BMP_RAM(0);
memset(screen, 0, SCREEN_WIDTH*SCREEN_HEIGHT);

uint8 data[256];
memset (data, 1, 256);

for (int i=0; i < 192; i++) {
memcpy(screen, data, i);
screen += 128; // 256px = 128 uint16
}

return 0;
}

Anyway, my advice would then be : use memcpy when you are sure it's aligned (ie both src and dst base addresses are multiple of 4, and the length is a multiple of 4)

I did some testing on loading individual tiles (64 bytes)
When I was loading full screen of tiles, this took:
856 microseconds for SwiCopy (COPY MODE WORD & HWORD are similar results)
3174 microseconds for memcpy
1700 microseconds for DMAcopy (WORDS/HALFWORDS, asynch/synch)

I am quite curious about the wiki remark about memcpy being the fastest.
For now, I'm using SwiCopy.

Mushu wrote:

Sorry to double post, but I'm a huge attention whore, and I wanted to get some attention :>

Anyway, I've never really been satisfied with the benchmarks done on memcpy/swiCopy/dmaCopy, simply because they all seem to show that memcpy is faster than dmaCopy. dmaCopy is a memory transfer with a dedicated controller, so why the hell would it be faster than a software memcpy?

Additionally, all of the benchmarks I've ever seen only gave a single number for each of the functions, rather than testing over a variety of transfer sizes and locations (main memory vs. VRAM vs. shared memory, etc). So I decided to hunker down and do some testing of my own, and the results were... informative.

I wrote a small app (source+binary to follow) to test all power-of-two block sizes from 16-1048576 bytes, using either main memory or VRAM as src/dst, and main memory for dst/src (I didn't want to end up with the situation where the source and destination are the same address, so only 1 of them is allowed to be VRAM).

Additionally, I mapped all the VRAM to 0x6000000, assuming that it will perform the same regardless of the mapping (which makes sense to me, but may not be the case).

Oh god I have to make a table. These values were obtained with the binary (to follow) on my DS Lite; there's probably at least 5 bugs in the source, so I wouldn't call them very conclusive, but they definitely do suggest conclusions -

Code:

Main Memory to Main Memory

16B 32B 64B 128B 256B 512B 1024B 2048B
memcpy 41 52 73 115 312 731 1219 3906
dmaCopy 94 158 286 542 1054 2118 4148 8244
swiCopy 123 167 221 383 770 1440 2716 6691

4096B 8192B 16384B 32768B 65536B
memcpy 11063 23697 48677 98291 196727
dmaCopy 16481 32887 65655 131263 262338
swiCopy 17789 38230 77787 155769 311512

etc, you get the picture. This matches what I've read so far, memcpy > dmaCopy > swiCopy, which is expected. When you switch over to VRAM, however, a different picture emerges -

Code:

Main Memory to VRAM

16B 32B 64B 128B 256B 512B 1024B 2048B
dmaCopy 48 56 72 104 168 324 592 1145
memcpy 58 85 139 247 463 939 1891 4001
swiCopy 168 204 360 598 1118 2124 4212 8456

4096B 8192B 16384B 32768B 65536B
dmaCopy 2226 4375 8627 17125 34121
memcpy 8711 18657 38657 77907 155903
swiCopy 17828 37827 76815 153746 307480

Anyway, so it appears that dmaCopy is significantly faster for transferring any size of data to VRAM, while memcpy is faster when working only in main memory. I would suspect that the dma hardware doesn't actually kick in unless you're transferring to VRAM, but I have no foundation for that conclusion.

I'm sure a lot of people already knew about this (or I've made a mistake somewhere ._.) but I've always seem to have read that memcpy is faster than everything else etc, and just though that was odd.

Blarg.

Sauce + Binary
Just the Binary

:<

Oh, and controls for the thing -
* A: Double the amount of memory allocated.
* B: Halve the amount of memory allocated.
* X: Toggle destination (VRAM/Main Memory)
* Y: Toggle source (VRAM/Main Memory)

this is from http://forum.gbadev.org/viewtopic.php?t=13242

source is there if you want to add more tests to it
_________________
http://www.myspace.com/knight0fdragonds

MK DS FC: Dragon 330772 075464
AC WW FC: Anthony SamsClub 1933-3433-9458
MPFH: Dragon 0215 4231 1206

Very informative. I too had to rely on speed gossips. Thanks for the link.

Nice benchmarks, however in my case the swiCopy is the fastest solution!
Why is this?
The source of the data lies in the const u8 [100000] array with bitmap, in which part of the memory does this lie? Target is the VRAM.

EDIT: I just checked and the source is in 0x2........ which is main RAM, so the dmaCopy should be fastest. However it is not :(

DOH!

My problem was the emulator!

Copying RAM->VRAM
Emulator NO$GBA 64 bytes
Swi (22)
dma (83)
memcpy (156)
Emulator NO$GBA 64kbytes
Swi (22) !!
dma (32819)
memcpy (120870)

Don't trust the emulator :) Lesson learned for me :)
On the hardware, indeed DMAcopy rules them all, with async option being the faster.

To make timing of SWI calls in NO$GBA somewhat more accurate, you can dump a BIOS from your DS and use that.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

tepples: How would one go about to do that? I found a program that dumped to the save-ram for GBA, but I haven't seen any solutions for NDS.

The only DS BIOS dumper I could find that's not on one of the Three Forbidden Sites is this one. SLOT-2 only, not GBAMP. If it came with source code, I would have ported it to libfat+DLDI like I did with the GBA BIOS dumper, but it doesn't.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

maybe this is helpful?
http://nds.cmamod.com/2007/01/24/dsbf_dump-79-bios-firmware-dumper/

I was just reading the source in there to dump the ARM7 ROM - yikes! How did someone figure out that those two instructions were there?!
Some people are just far too good!
_________________
Big thanks to everyone who donated for Quake2

gbadev.org forum archive

DS development > swiCopy problem?

#148839 - JanoSicek - Fri Jan 11, 2008 1:40 am

#148856 - simonjhall - Fri Jan 11, 2008 11:06 am

#148859 - nipil - Fri Jan 11, 2008 1:59 pm

#148860 - JanoSicek - Fri Jan 11, 2008 2:50 pm

#148874 - eKid - Fri Jan 11, 2008 5:03 pm

#148877 - JanoSicek - Fri Jan 11, 2008 5:10 pm

#148953 - simonjhall - Sat Jan 12, 2008 2:02 pm

#148964 - Cearn - Sat Jan 12, 2008 6:04 pm

#148977 - simonjhall - Sun Jan 13, 2008 12:12 am

#148996 - nipil - Sun Jan 13, 2008 11:20 am

#149111 - JanoSicek - Tue Jan 15, 2008 2:53 pm

#149114 - knight0fdragon - Tue Jan 15, 2008 4:19 pm

#149130 - nipil - Tue Jan 15, 2008 8:21 pm

#149139 - JanoSicek - Tue Jan 15, 2008 10:39 pm

#149141 - JanoSicek - Tue Jan 15, 2008 11:24 pm

#149154 - tepples - Wed Jan 16, 2008 1:29 am

#149158 - kusma - Wed Jan 16, 2008 2:07 am

#149161 - tepples - Wed Jan 16, 2008 2:33 am

#149172 - OSW - Wed Jan 16, 2008 9:43 am

#149174 - simonjhall - Wed Jan 16, 2008 10:24 am