gbadev.org forum archive

I've recently rejigged loads of my rendering code to use display lists, and I'm lovin' the speed increase.
However, reading the code in glCallList I'm just interested in a comment in there:

Code:

GL_STATIC_INL void glCallList(const u32* list) {
u32 count = *list++;

// flush the area that we are going to DMA
DC_FlushRange(list, count*4);

// don't start DMAing while anything else is being DMAed because FIFO DMA is touchy as hell
// If anyone can explain this better that would be great. -- gabebear
while((DMA_CR(0) & DMA_BUSY)||(DMA_CR(1) & DMA_BUSY)||(DMA_CR(2) & DMA_BUSY)||(DMA_CR(3) & DMA_BUSY));

// send the packed list asynchronously via DMA to the FIFO
DMA_SRC(0) = (uint32)list;
DMA_DEST(0) = 0x4000400;
DMA_CR(0) = DMA_FIFO | count;
while(DMA_CR(0) & DMA_BUSY);
}

Although I don't actually use this function (I use something I wrote myself which does it asynchronously), Im a little worried by the comment right in the middle. Has anyone else had troubles with DMA to the graphics FIFO at the same time as other DMAs?

I don't bother wait for other DMAs whilst doing it and it works fine for me... Assuming there is some issue with doing other DMAs at the same time, what if ARM7 DMA is being used at the same time as the GFX DMA?

Unrelated: although the ARM7 can't talk to the 3D hardware, can ARM7 DMA transfer data to the GFX FIFO?
EDIT: no, it can't

Peace.
_________________
Big thanks to everyone who donated for Quake2

Last edited by simonjhall on Mon Dec 31, 2007 6:12 pm; edited 1 time in total

simonjhall wrote:

Code:

// flush the area that we are going to DMA
DC_FlushRange(list, count*4);

Note sure if this still applies, but a couple of months ago I figured out the following:

DC_FlushRange does not test if the data to flush is larger than the cache. If you flush-range 100kb, it flushes the whole 100kb where a DC_FlushAll would be quite faster.

Also, once the memory location where the displaylist is located is flushed, you don't need to flush it again, unless you modify it of course.

imo it makes little sense to flush-range in a render-call. In my program, during level loading, I load all displaylists in memory, then DC_FlushAll and don't need to flush them anymore. Works like a charm that way for me.

PS: I also don't wait for DMA and haven't noticed a problem so far.
_________________
Kind Regards,
Peter

If you try and flush 100kb but none of that is in the cache, it should be instant - right? So I'd imagine that there's gotta be an upper bound on the amount of time it takes to flush a range of data, and this can't be more than flushing the entire cache...
Having said that, I have noticed that flushing a range is often mush slower than flushing the entire cache :-)

One final gotcha I found recently:

Code:

void *ptr = malloc(size);
invalidate_range(ptr, size);
dma(source, ptr, size);
flush_range(ptr, size);

When you flush or invalidate an area of memory, it's going to be rounded up to the nearest 32 bytes, since that's the size of a cache line. So this fragment of code can be dodgy if size isn't a multiple of 32 bytes as some mallocs store information in the memory around the allocated block. malloc/free won't be happy if this information is invalidated away :-)
...this took ages for me to figure out!

To summarise, kids if you're gonna invalidate parts of your cache make sure it's a size that's a multiple of the size of a cache line!
_________________
Big thanks to everyone who donated for Quake2

Funny note: I'm currently doing my own roll-out of glCallList to avoid waiting for the DMA in a safe way. I posted about it in the "Coding"-forum yesterday. ;)

My idea is to reserve one DMA for the gfx-fifo, and making a queue of dma-jobs to perform. New jobs can be started from the dma-irq. To make this work robustly, you need a queue which is either atomic, or some locks to make it atomic. This is what I'm currently working on. Some other logic is needed, like not flushing until the queue is finished and so on.

simonjhall wrote:

If you try and flush 100kb but none of that is in the cache, it should be instant - right? So I'd imagine that there's gotta be an upper bound on the amount of time it takes to flush a range of data, and this can't be more than flushing the entire cache...

Actually it would probably make more sense to either invalidate the entire cache if the size is greater than the size of the cache. DC_FlushRange currently flushes a line at a time for the entire range given so it's likely to be a tad slower.

I keep meaning to have a look at the ARM cache control again and see if there's a way to determine if a cache line is within a particular range. That way we could just check each line & flush/invalidate instead of doing it for an entire address range.

Quote:

One final gotcha I found recently:

Code:

void *ptr = malloc(size);
invalidate_range(ptr, size);
dma(source, ptr, size);
flush_range(ptr, size);

When you flush or invalidate an area of memory, it's going to be rounded up to the nearest 32 bytes, since that's the size of a cache line. So this fragment of code can be dodgy if size isn't a multiple of 32 bytes as some mallocs store information in the memory around the allocated block. malloc/free won't be happy if this information is invalidated away :-)
...this took ages for me to figure out!

I bet, that's not even something I've ever thought about to be honest. Cache coherency is such a nightmare.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

gbadev.org forum archive

DS development > GFX FIFO DMAage

#147992 - simonjhall - Mon Dec 31, 2007 3:42 pm

#147995 - Peter - Mon Dec 31, 2007 4:08 pm

#148001 - simonjhall - Mon Dec 31, 2007 4:56 pm

#148004 - kusma - Mon Dec 31, 2007 5:07 pm

#148017 - wintermute - Mon Dec 31, 2007 7:31 pm