gbadev.org forum archive

How much data can be loaded to texture-allocated vram banks in between frames? If I remember correctly, because of the 48 scanline cache, I could start loading at scanline 144 (is there some way to tell when the load to that cache is done?), but at scanline 214, the hardware processes the 3D data given it.

So, in ~70 scanlines if I'm correct, how much data can be copied straight into VRAM?
_________________
DS - It's all about DiscoStew

Using all four dma channels I managed to max it out at a little over 32kb.

Thats enough time to get an entire 128x128x16bit texture up, but not much else.

Thx, though that kinda cuts into what I had planned. I had thought that more could be done.
_________________
DS - It's all about DiscoStew

All 4 DMA channels? If I remember right, each channel co-ops the other so there is no gain in using more than 1 channel.

Also, if you are doing any sort of heavy wireless connection stuff, this nearly renders the 3D system useless.

I managed to load a 256*256 16 bit texture in the VBlank without any flickering. Here's the code I used:

Code:

void on_irq()
{
if(REG_IF & IRQ_VBLANK) {
printf("\x1b[2J");
printf("\nin the VBLANK");
Texture.Load(16, TEXTURE_SIZE_256, GL_RGB, "bigface256.pcx");
// Tell the DS we handled the VBLANK interrupt
VBLANK_INTR_WAIT_FLAGS |= IRQ_VBLANK;
REG_IF |= IRQ_VBLANK;
} else {
REG_IF = REG_IF; // Ignore all other interrupts
}
}

void CTexture::Load(int index, int Size, int Format, const char *TextureFile)
{
loadPCX((u8*)gbfs_get_obj(&data_gbfs, TextureFile, NULL), &pcx);
image8to16trans(&pcx, 0);
myglGenTextures(1, &Texture.texture[index]);
LoadToVram(index, Format, Size, pcx.image.data8, TEXGEN_TEXCOORD | GL_TEXTURE_WRAP_S | GL_TEXTURE_WRAP_T);
imageDestroy(&pcx);
}

int CTexture::LoadToVram(int index, int type, int Dimensions, uint8* texture, int param)
{
uint32* addr;
uint32 vramTemp;
uint32 size = 1 << (Dimensions + Dimensions + 6);
switch (type) {
case GL_RGB:
case GL_RGBA:
size = size << 1;
break;
case GL_RGB4:
size = size >> 2;
break;
case GL_RGB16:
size = size >> 1;
break;
default:
break;
}

addr = (uint32*)GetBestTextureSlot(size, index);
if(!addr) { return 0; }

Slots[ToTextureSlot[index]].Param = param;
Slots[ToTextureSlot[index]].Dimensions = Dimensions;
Slots[ToTextureSlot[index]].Mode = type;
Slots[ToTextureSlot[index]].GlobalTextureIndex = TextureToWrite;
// unlock texture memory
vramTemp = vramSetMainBanks(VRAM_A_LCD,VRAM_B_LCD,VRAM_C_LCD,VRAM_D_LCD);
if (type == GL_RGB) {
// We do GL_RGB as GL_RGBA, but we set each alpha bit to 1 during the copy
u16 * src = (u16*)texture;
u16 * dest = (u16*)addr;
glTexPar(Dimensions, Dimensions, addr, GL_RGBA, param);
Slots[ToTextureSlot[index]].Mode = GL_RGBA;
for(uint32 i = 0; i < ((size) >> 1); i++) {
*dest = *src | (1 << 15);
dest++;
src++;
}
} else {
// For everything else, we do a straight copy
glTexPar(Dimensions, Dimensions, addr, type, param);
swiCopy((uint32*)texture, addr , size / 4 | COPY_MODE_WORD);
}
vramRestoreMainBanks(vramTemp);
return 1;
}

zeruda,

If you were able to copy 128kBytes into VRAM in one VBlank with needing to convert it from pcx and then manually copying it in (because of the GL_RGB format), you think it would be possible of doubling that amount if the data had no conversions or manual copying, but was just DMAed straight into VRAM?

I'm basically trying to reduce my overall use of vram banks from 4 to 2 so that I can have access to 2 banks for capture effects. This shouldn't be a problem for actual rendering if I can copy 2 vram banks worth in a vblank, as I can partition my 3D data into 2 layers based on depth, and just have my program run at 30fps instead of 60.
_________________
DS - It's all about DiscoStew

A little off topic, but it does involve my test with copying to VRAM.

I'm currently using DMA to copy from Main Memory to VRAM (because it's been tested to be the best for large amount of data), and I've got a question. People have been saying that when they used the Asynch DMA functions, and split the amount of data between them, they got stuff copied faster ( first 3 channels as asynch, and last as normal copy with the spin check ). I tried it, but it's taking just as long to copy data as with a single dma function that copies the entire section.

This...

Code:

dmaCopyWordsAsynch( 0, src, VRAM_A, 0x10000 );
dmaCopyWordsAsynch( 1, src + 0x8000 , VRAM_A + 0x8000, 0x10000 );
dmaCopyWordsAsynch( 2, src + 0x10000, VRAM_A + 0x10000, 0x10000 );
dmaCopyWords( 3, src + 0x18000, VRAM_A + 0x18000, 0x10000 );

...takes just as long to copy data as this does...

Code:

dmaCopyWords(3, src, VRAM_A, 0x40000 );

Any reason why this is happening?

EDIT:

Just to add Miked0801, according to posts made in older threads on the DMA subject, the use of the functions above in that particular order resulted in faster data copying, though I'm not seeing it. Perhaps one of those people could shed some light on the subject?
_________________
DS - It's all about DiscoStew

DiscoStew wrote:

Code:

dmaCopyWordsAsynch( 0, src, VRAM_A, 0x10000 );
dmaCopyWordsAsynch( 1, src + 0x8000 , VRAM_A + 0x8000, 0x10000 );
dmaCopyWordsAsynch( 2, src + 0x10000, VRAM_A + 0x10000, 0x10000 );
dmaCopyWords( 3, src + 0x18000, VRAM_A + 0x18000, 0x10000 );

I'm not sure why it's slower, but are you attempting to copy 0x40000 bytes as it looks like your offsets are wrong:

0 = 0x00000 - 0x10000
1 = 0x08000 - 0x18000
2 = 0x10000 - 0x20000
3 = 0x18000 - 0x18000

To me your overlapping memory copies, which may be causing the slowdown your experiencing.

But please bare in mind I do not know what I'm talking about as I've never attempted to copy such large amounts of data on the DS yet.

SteveH wrote:

DiscoStew wrote:

Code:

dmaCopyWordsAsynch( 0, src, VRAM_A, 0x10000 );
dmaCopyWordsAsynch( 1, src + 0x8000 , VRAM_A + 0x8000, 0x10000 );
dmaCopyWordsAsynch( 2, src + 0x10000, VRAM_A + 0x10000, 0x10000 );
dmaCopyWords( 3, src + 0x18000, VRAM_A + 0x18000, 0x10000 );

I'm not sure why it's slower, but are you attempting to copy 0x40000 bytes as it looks like your offsets are wrong:

0 = 0x00000 - 0x10000
1 = 0x08000 - 0x18000
2 = 0x10000 - 0x20000
3 = 0x18000 - 0x18000

To me your overlapping memory copies, which may be causing the slowdown your experiencing.

But please bare in mind I do not know what I'm talking about as I've never attempted to copy such large amounts of data on the DS yet.

I should have made mention that both "src" and "VRAM_A" are unsigned short pointers, so the offsets used for each are adjusted by double ((u16*)0x0100 + 0x010 = 0x0120)

EDIT:

I still haven't figured out if it is still possible to use all 4 channels at the same time, but I wonder. What about using some assembly with "ldmia" and "stmia"? Because the entire set of DMA registers (minus the FILL) comes out to about 36 bytes, I was thinking of filling out the first register with the necessary information, and then making an 4-byte 8 element array that is filled with the rest, and just do the load and store. Could that do it, or is that method of copying done via entry-by-entry, which would still activate the first channel before getting the others started? As it's been reported in GBATek...

Quote:

The CPU can be kept running during DMA, provided that it is accessing only TCM (or cached memory), otherwise the CPU is halted until DMA finishes.

_________________
DS - It's all about DiscoStew

DiscoStew wrote:

I still haven't figured out if it is still possible to use all 4 channels at the same time

I posted my thoughts on this before: http://forum.gbadev.org/viewtopic.php?p=161609#161609
But I'm not too sure if it works (I never really needed to do it). I believe it's just a matter of filling all your params for all four DMA channels and then set the execution time for VBlank (or whatever, but not 'immediate' mode) and the whole she-bang should go.

However, I just do what was said earlier too: I loop calls to dmaCopyWordsAsynch() 'til there's no more data to send up. Works good enough for me (Shhh, don't tell Chris Hecker that).

The method a couple of posts above shows what I tried to do, which is what others have said worked, but doesn't for me.

The VBlank method doesn't seem to be working for me

Code:

.
irqInit();
irqSet( IRQ_VBLANK, VBlank_Handler );
irqEnable( IRQ_VBLANK );
.

Code:

void VBlank_Handler()
{
while( DMA_CR( 0 ) & DMA_BUSY );
DMA_SRC( 0 ) = (uint32)tex;
DMA_DEST( 0 ) = (uint32)vram;
DMA_CR( 0 ) = 0x4000 | DMA_32_BIT | DMA_START_VBL | DMA_ENABLE;
return;
}

Doing this method shows nothing on the screen. I even made print statements in the VBlank_Handler function, and those show up fine, but the texture doesn't. When I remove DMA_START_VBL, the texture shows. In that circumstance, I moved the while loop after the DMA functions, made a counter variable before them, and incremented the counter inside the while loop. The counter result was 0, as if it never entered the while loop.
_________________
DS - It's all about DiscoStew

Your DMA_START_VBL is set in the actual vblank handler so it won't go til next round, no?
Anyway, I just re-read a little on gbatek and I think this "all 4 channels at once" thing is not going to work no matter what:

Quote:

the highest priority is assigned to DMA0, followed by DMA1, DMA2, and DMA3. DMA Channels with lower priority are paused until channels with higher priority have completed

I'm out of town in Winnipeg and can't actually code and test anything that I'm babbling about though :(

Yep - only 1 DMA fires off at a time guys. There is no gain from using multiple interrupts (not channels) to do your bidding. 0 will go, then 1, then...

ldmia/stmia can beat DMAs on shorter copies as well due to register setup time.

ritz wrote:

Your DMA_START_VBL is set in the actual vblank handler so it won't go til next round, no?

I'm not getting anything when I use DMA_START_VBL, but you think that because it's being used while in the "interrupt" state, it might be dismissing it? I'll try it outside of it to see if anything changes.

ritz wrote:

Anyway, I just re-read a little on gbatek and I think this "all 4 channels at once" thing is not going to work no matter what:

Quote:

the highest priority is assigned to DMA0, followed by DMA1, DMA2, and DMA3. DMA Channels with lower priority are paused until channels with higher priority have completed

I'm out of town in Winnipeg and can't actually code and test anything that I'm babbling about though :(

I do remember that, but I thought that was mainly for the use on the GBA. If it holds true for the DS, then perhaps some reasons to still have 4 channels is for GBA compatibility, as well as for other uses like what I'm trying to do, except in the sense of having up to 4 different sections of data being copied to 4 different areas, and it goes through each one after the other.

EDIT:

Just gave the altered method a try (moving the DMA functionality outside the handler), and now it's copying, and the texture is showing up. I had to add in a VCount interrupt alongside the VBlank interrupt so it would change to state of the vram banks to LCDC between finishing up the last buffered scanline, and the beginning of the VBlank. I went and gave it a try with all 4 channels afterwards. Unfortunately, this method shows that the DMA copying is done one channel at a time.
_________________
DS - It's all about DiscoStew

DiscoStew wrote:

If you were able to copy 128kBytes into VRAM in one VBlank with needing to convert it from pcx and then manually copying it in (because of the GL_RGB format), you think it would be possible of doubling that amount if the data had no conversions or manual copying, but was just DMAed straight into VRAM?

I got no idea, I didn't bother with timings, best just to experiment, I had it such that when I pressed a button 1 or more textures would load. When there was too much there would be a flash, otherwise a straight transition. This was ages ago though. How many textures are you using? Are you using compressed textures?

In terms of DMA, when you fire a DMA it runs alongside the CPU. You can then simultaneously use the CPU and process data in the TCM. As soon as you touch the main RAM though the CPU stalls until the DMA completes. More specifically, if you access any variables or any code or functions in the main RAM it stalls. So looking at your code:
The variable src; where is it? If on the stack/dtcm then it's ok. I'm guessing this is fine.

The function dmaCopyWordsAsynch and dmaCopyWords. Where are these. If these are in main ram then the first call runs, then it stalls as soon as the call to the second dmaCopyWordsAsynch function occurs. To get around this these functions will have to be placed in the ITCM.

gbadev.org forum archive

DS development > Texture load amount in between frame renders?

#170240 - DiscoStew - Fri Sep 11, 2009 1:42 am

#170242 - TwentySeven - Fri Sep 11, 2009 4:23 am

#170243 - DiscoStew - Fri Sep 11, 2009 4:50 am

#170247 - Miked0801 - Fri Sep 11, 2009 2:43 pm

#170248 - zeruda - Fri Sep 11, 2009 3:12 pm

#170251 - DiscoStew - Fri Sep 11, 2009 4:38 pm

#170255 - DiscoStew - Sun Sep 13, 2009 2:47 am

#170257 - SteveH - Sun Sep 13, 2009 4:37 pm

#170259 - DiscoStew - Sun Sep 13, 2009 5:47 pm

#170263 - ritz - Sun Sep 13, 2009 10:44 pm

#170264 - DiscoStew - Mon Sep 14, 2009 3:56 am

#170265 - ritz - Mon Sep 14, 2009 4:25 am

#170267 - Miked0801 - Mon Sep 14, 2009 5:27 am

#170268 - DiscoStew - Mon Sep 14, 2009 5:27 am

#170269 - zeruda - Mon Sep 14, 2009 7:53 am