gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > gx_fifo and dma

#150123 - nce - Wed Jan 30, 2008 7:16 am

Hi,

I'm currently trying to find one good way to render my 3d scene in my project.
I've then start to look in the GXFIFO ( it's time. I've always worked in immediat mode 'til now)

Reading the gbatek and some of the thread here, I'm starting to get an idea of how to do it.
But I have several questions to be sure that I've understood everything correctly.


The main idea is to try to freeze the cpu as less as possible.

having a scene defines as several block, each block having a displaylist.
And a small table saved in DTCM representing the position of each block. ( and some other stuff like the camera frustrum )

it should be possible to :
1 - find a block to draw and launch a DMA copy on GXFIFO
2 - while the DMA is busy. compute the visiblity of other blocks and put the visible one in a small stack
3 - as soon as the dma is not busy launch a DMA copy of the next block in the stack. and go back to point 2

If I've understood correctly, if all my data are in DTCM or cache the cpu should be able to continue the computation even during the dma copy. right ?
Do I have to put my code in ITCM too ?


Next point is about the gfx fifo it-self. ( I had to read that part a lot of time to understand it :) )


the fifo is 256 entries the dma will wait (without freezing the cpu) until the fifo get 128 entries free and then it will copy 112 words of data.
The 112 words will be unpacked and if those unpacked commands fill the fifo the dma and cpu will be frezzed !

in the gbatek this exemple is used : sending 112 x Packed(00151515h) to GXFIFO would write 336 x Cmd(15h) to the FIFO
for me 0x00151515 is not word but dword, this probably means that gbatek use the word "word" for 32bit value and not 16bit like I'm used to. right ?

so it means that when you build the call list you have to check that each 112 words doesn't unpack in more that 128 entries and you will never freeze you cpu at render time...



what do you think about this idea ?



I was also thinking about doing more or less the same thing when rendering my characters.
because characters are using skeleton I need the gfx to compute matrix stuff so I can't compute the skeleton at the same time as rendering the character.
But maybe I can :

1 - compute the skeleton matrix for each characters
2 - set all matrix in the stack for one character
3 - dma the call list
4 - while dma is busy, compute the AI for the next frame
5 - while dma is free back to 2




one last thought.
I know that the BOX_TEST exist. It should be possible for each block to send a box_test and doing something while waiting for the result as for the character and then during the rendering again doing something else.
But, what can I do ? that's the question :) During the box_test I could do my collision code, but during the rendering ?
- not the sound it uses dma too (I think, never played with that yet)
- I was thinking of maybe doing some sort of bloom, but I can't store the buffer in DTCM ( too big )

any idea ?


what do you thing will be faster ? the first or the second idea ?



thanks,
_________________
-jerome-

#150126 - simonjhall - Wed Jan 30, 2008 9:03 am

Just a quick reply as I have to run to work!
a) you're right, you want to freeze the cpu for as little time as possible. By accessing main memory whilst you use dma, you'll slow down both the dma and the regular load/stores. By putting your data in dtcm you'll avoid bus traffic and hopefully everything will go faster

b) by putting code in itcm, when you get icache misses it won't have to access the bus, slowing down the dma. Remember to put you interrupt handlers in here too if you really want to avoid bus traffic.

c) remember that dma can't access the stack, dtcm or itcm

d) the idea of pre-calcing blocks and dmaing them off whist visibility testing is good. That's what I do :-)
You could even build the display lists on the second processor. Don't forget about data caching.

e) the world limit and that packed->unpacked thing: don't worry. Remember that the thing mentioned about unpacking and running out and slowing down is for commands which have no arguments. I'm sure that the majority of your commands will have arguments, and they don't get packed/unpacked - cos they're just data.

f) afaik, you can't put a box test in a display list.

Time to go to work :(
_________________
Big thanks to everyone who donated for Quake2

#150191 - wintermute - Thu Jan 31, 2008 1:08 am

simonjhall wrote:
Just a quick reply as I have to run to work!
a) you're right, you want to freeze the cpu for as little time as possible. By accessing main memory whilst you use dma, you'll slow down both the dma and the regular load/stores. By putting your data in dtcm you'll avoid bus traffic and hopefully everything will go faster


Not entirely true - the CPU will be stalled if it accesses hardware registers which means interrupts will be delayed during DMA. Also remember that data in DTCM reduces the space available for your stack.

Quote:

b) by putting code in itcm, when you get icache misses it won't have to access the bus, slowing down the dma. Remember to put you interrupt handlers in here too if you really want to avoid bus traffic.


See above. The CPU will be stalled in the dispatcher. The libnds dispatcher is already in itcm.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#150213 - simonjhall - Thu Jan 31, 2008 9:00 am

wintermute wrote:
Not entirely true - the CPU will be stalled if it accesses hardware registers which means interrupts will be delayed during DMA. Also remember that data in DTCM reduces the space available for your stack.
Oh really! I'd never thought about that actually!

But yeah I doubt DMA is gonna be noticably faster by doing absolutely zero work on the bus - someone wanna knock up an example to see?
One way to do it would be to turn off interrupts, stick the code (on both cpus) in a tight loop and get a bunch of DMA transfers all going from different addresses. If the time they take (combined) is noticably longer than the time they would do if you were run sequentially then...you see where I'm going here!
_________________
Big thanks to everyone who donated for Quake2

#150427 - nce - Mon Feb 04, 2008 11:19 am

Hi,

and thx for your replies :)

I'v continued to think about this idea.
From what you said, it looks probably quite complex to get a dma working really in parallele with the cpu.

So the new idea :
- I could maybe cut the environment in block of 256 entries.
- copy those 256 entries with a memcpy (some thread here looks to say that the memcopy is faster than a dma copy)
- then do some computation (no need to be in the DTCM anymore) while waiting for the FIFO to be empty.


Is this sound stupid ?
_________________
-jerome-

#150480 - elhobbs - Tue Feb 05, 2008 2:38 pm

there is an existing thread comparing dma and memcpy in different contexts

http://forum.gbadev.org/viewtopic.php?t=13242

I think dma is faster from main ram to vram and memcpy is faster to and from main ram