#176065 - LOst? - Tue Mar 29, 2011 8:39 pm
One of my games I have made, which I can't post because I don't own the rights to it, is working really good in DS emulators.
Tiles use one of 9 extended palettes to fade to a dark tone of the original color, yet not the same exact color as you might think when we are talking fade. So it has the lighter color before the tile gets locked on the playfield, and fades when it is about to get locked by animating the extended palette index.
The problem is on real hardware. At random times, sometimes very early, and sometimes very late in the gameplay, all the tiles go super dark (as in using the darkest of the extended palettes), and stay that way for a random amount of time. I have tired this on original DS, DS lite, DSi, and DSi XL, all with the same annoying bug.
In my gameplay, I have a hidden mode which darkens all the tiles for a short time (almost exactly as the bug, only darker I think), to make the game harder, but that mode has never been finished, and the condition to get into that mode is hardcoded to never happen. If the condition is buggy, then it should have been triggered on both emulators and real hardware, so it cannot be that, can it?
This is so difficult. The game had been delayed for 3 years because of this bug, and I have basically left it alone, hoping that the bug will represent itself in a newer emulator. Not even NO$GBA's DS emulation can trigger the bug. Not even Nintendo's... yea, when I say emulators, i mean all of them, legal or not.
So what can it be? An overflow in the timer that causes the fade? But why is it so random? Is this a cache thing again? I have never understood the data and instruction caches on the DS fully.
It is so sad, when a game is so damn good, that all the time spent making it got to waste because of an impossible bug!
The game engine was programmed to be portable, and i have the game on other platforms, and there is no bug.
Well, if you can't help me, at least tell me when you need to cache data? You know, if you must cache 4 bytes of data, or only arrays, or a specific array size, and if DMA caches data automatically. I kinda lost faith in DS emulation when I saw NO$GBA doing all my sprites right, but on the real hardware, the sprites were delayed by 1 frame. It is so difficult to know when the right time to copy your OAM list to the actual OAM, and when to cache the data!
_________________
Exceptions are fun
#176066 - keldon - Tue Mar 29, 2011 9:28 pm
DMA doesn't guarantee the cache is flushed, that's you're responsibility, and given that the bug does not occur on the emulators may be a hint as to some of the factors that may be involved.
IIRC, data cache [flushing/]invalidation has some sort of alignment, and I'm pretty sure DMA writes don't invalidate the cache either (which is your responsibility to report).
Regarding copying OAM, my policy is to simply store my own 'shadow' OAM data, any update to my OAM will result in that data being transferred during VBlank. I do the same thing with screenblock entries and sprite/bg tiles as well - i.e. I always defer writes to VRAM.
So how about wrapping your DMA calls to flush the source beforehand and destination afterwards for a start? Edit: actually that's wrong, should be destination beforehand as well [link].
---
Oh, and random times ... what would happen if you recorded your pad input to a file and set the game to run that sequence from boot? At least that way you could home in on exactly when it happens (with some trial and error of course). And if the 'gameplay-mode' code is being called, you can also make use of a log of exactly what frame this occurs on ...
... well those are the areas I'd start.
Last edited by keldon on Wed Mar 30, 2011 6:32 am; edited 1 time in total
#176067 - ant512 - Tue Mar 29, 2011 10:41 pm
DMA bypasses the cache entirely. You can encounter DMA problems in two cache-related ways:
- The DMA can copy old data from memory because the cache hasn't been written out to main RAM yet;
- The memory written to by the DMA can be ignored because the cache is considered to still be up to date after the DMA write has finished.
Here's an example of the first problem:
- CPU is told to write 32 zeros to memory location x, overwriting 32 ones;
- CPU actually writes 32 zeros to cache;
- DMA is told to copy 32 bytes from memory location x to y;
- DMA copies 32 ones from x to y.
What the DMA will copy is 32 ones, not 32 zeros, because the cache hasn't been flushed. This is what should happen:
- CPU is told to write 32 zeros to memory location x, overwriting 32 ones;
- CPU actually writes 32 zeros to cache;
- CPU is told to flush cache;
- Cache writes 32 zeros to memory location x;
- DMA is told to copy 32 bytes from memory location x to y;
- DMA copies 32 zeros from x to y.
To ensure that what the DMA reads is correct, you must issue a cache flush instruction for a memory region before you start the DMA.
The second problem can be illustrated like this:
- CPU is told to read an int from memory location x;
- CPU reads value 16;
- Cache stores value 16 for location x;
- DMA is told to write 4 to memory location x;
- DMA writes value 4 to memory location x;
- CPU is told to read an int from x again;
- CPU reads value 16 from the cache.
This is what should happen:
- CPU is told to read an int from memory location x;
- CPU reads value 16;
- Cache stores value 16 for location x;
- DMA is told to write 4 to memory location x;
- DMA writes value 4 to memory location x;
- Cache for memory location x is invalidated;
- CPU is told to read an int from x again;
- CPU reads value 4 from memory location x.
Always make sure that you invalidate the cache for any memory regions you write to with the DMA.
Basically, follow this process whenever you use the DMA:
- Flush cache for the source memory region;
- Invoke DMA;
- Invalidate cache for the destination memory region.
Checking if the DMA is actually the problem in this case is pretty simple - replace any calls to the DMA with memcpy() or even for/next loops to copy bytes at a time. If the problem disappears you know it's a DMA issue. If it doesn't, you can disregard the DMA theory.
#176070 - coreyh2 - Wed Mar 30, 2011 5:25 am
#176072 - LOst? - Wed Mar 30, 2011 10:34 am
Very good information about the cache! I have learned alot! Thanks ant512 for such a detailed explaination, and having coreyh2's examples helped alot also!
keldon, thanks for the information as well! The interesting part about OAM is that I updated it just like you said, in the Vblank, but something wasn't right on the real hardware! You see, I am updating my backgrounds on each scanline using HDMA which is initialized in that same Vblank. The scanline buffer is a double buffer that page flip so that I can write new scanlines while the other ones are used. This causes the sprites to be right, but the background to be one frame behind on the real hardware. I think the problem is yet again that I haven't understood the cache, and was guessing when I programmed that.
About the bug, if it is cache related, it is possibly a DMA operation. But what if I store words or double words using a for-loop, or a memcpy?
Will the cache be updated automatically when the CPU takes care of reads and writes? (I hope so). Is that true even if I write 32 bit values, or even 64 bit values? 64 bit values are good for high precision calculations, but I haven't used that yet (but I will in the future).
The bug is very random in its appearance, and the game acts on randomizers as well, but I have a recorded demo playback which I can get running by pressing a key sequence at the title screen, which I just have to change so it starts immediately (causing all randomizers to use the same values). I will then see if the bug happens at the same time, everytime I boot.
_________________
Exceptions are fun
#176074 - sverx - Wed Mar 30, 2011 1:34 pm
Normally VRAM banks, palettes and OAMs are uncached on dkA/libnds, unless you changed it. I mean: if it looks like something is changing the content of your tiles (specifying a different ext. palette number) OR looks like something is changing the palette, this anyway has probably nothing to do with caches. Unless you've got a copy of your tiles (or the palettes) in your main memory, you change them and you DMA them to VRAM/ext. palette. If it's so, then simply flushing caches before launching DMA copy should be enough. Or use a memcopy(), which of course is a CPU driven process, involving caches.
(I had a somehow similar problem some times ago but in my case I was changing contents of ram both with ARM9 and ARM7. Of course ARM9 wasn't aware of ARM7 changes unless I expressly invalidate its data cache...)
#176078 - keldon - Thu Mar 31, 2011 1:40 pm
LOst? wrote: |
About the bug, if it is cache related, it is possibly a DMA operation. But what if I store words or double words using a for-loop, or a memcpy?
Will the cache be updated automatically when the CPU takes care of reads and writes? (I hope so). |
If VRAM is uncached then you will be fine, if it were then I would invalidate the destination (or at least test to confirm the behaviour).
LOst? wrote: |
I will then see if the bug happens at the same time, everytime I boot. |
Well my thinking there was in isolating the incidents so that you can track the conditions (if changing the transfer methods does not remove the bug).
#176121 - Miked0801 - Tue Apr 12, 2011 6:02 pm
Anytime we've done any sort of DMA on DS, it needs a cache invalidation (which hurts DMA performance, but oh well.) If this is indeed a caching issue though, I'd expect to see 1 or 2 frame glitches, not continuous bad info for a while and then back to normal, but with DMA you never know.
Honestly though, this sounds more like a memory corruption error like a write off the end of an array thing. If you have a debuger that allows, put a write to memory break point on your data to see if something is coming along and changing it when it's not supposed to.
#176177 - LOst? - Tue May 03, 2011 10:27 am
Okay, so all this cache talk has really put a big question on my mind:
How can we properly use HDMA (hblank DMA), when DMA is activated automatically at any random time, the cache might have changed?
I am a big fan of HDMA, because I can then keep a background scanline scroll list for each frame (page flipped), and have it transfered to the next visible frame. This worked really well with the GBA. It can be used for mode 7 and scanline scrolling.
I am looking for the best Nintendo DS solution to the HDMA situation. You really don't want to use a Hblank interrupt, or the Vcounter interrupt, because they can be used for other tasks. You want HDMA because it was designed to aid mode 7 games. Or is it just a broken leftover from the GBA?
Edit for more information:
Cydrak wrote: |
Sorry, couple things I should have mentioned. :-)
DMA doesn't know anything about TCM or cache. This becomes a problem as the stack is in DTCM, so transfer from a local array won't work. It needs to be somewhere in RAM or VRAM (such as by making it global or static). Further, if in RAM, you may need to flush changes out of cache.
Also, to make the new DMA_SRC / DMA_DEST settings "stick", DMA_ENABLE needs to be zero beforehand; it only responds when changed:
Code: |
DC_FlushRange(gradient, sizeof(gradient));
...
while( awesomeness )
{
swiWaitForVBlank();
...
DMA_CR(0) = 0;
DMA_SRC(0) = (u32)&gradient[1];
DMA_DEST(0) = (u32)&BG_PALETTE[0];
DMA_CR(0) = DMA_ENABLE | DMA_REPEAT | DMA_START_HBL | DMA_DST_RESET | 1;
...
} |
|
Cydrak's example code here only does a gradient fill in the while loop, and by that, a flush is possible before the while loop. But in my case, I will be doing DMA for OAMs and change the inactive scanline buffer page which may damage the cache used by the HDMA at the same time/before/after a scanline update. Since HDMA starts randomly, you can't time such event (that's the power of the HDMA, that you don't need to time anything yourself). The HDMA might be off at scanline 3 by the time another DMA is activated by the game, or as I said, I might be writing or reading the inactive scanline buffer, or any other buffer that might damage the cache used by the HDMA when it does the next scanline.
But then Cydrak talks about another source: VRAM. Is it possible to allocate a R/W buffer big enough for 4 X and Y BG offset registers TIMES 2 (for active/inactive page flipping) TIMES 192 (num scanlines) in VRAM, and have that buffer as the source of the HDMA, without having to do anything with the cache? If the size is a problem, num of backgrounds can be decreased, and anything 3D will not be used in hope that there will be VRAM space left.
_________________
Exceptions are fun
#176180 - sverx - Wed May 04, 2011 8:28 am
Since I believe HDMA will read from an array of values stored somewhere you could either:
- tell the compiler to store that array in an un-cached location
- tell the processor to do not cache the memory location where that array is
- always access (write to) the array trough an uncached mirror address (my choice would be this one) so that cache won't ever get in between
#176181 - LOst? - Wed May 04, 2011 11:42 am
sverx wrote: |
Since I believe HDMA will read from an array of values stored somewhere you could either:
- tell the compiler to store that array in an un-cached location
- tell the processor to do not cache the memory location where that array is
- always access (write to) the array trough an uncached mirror address (my choice would be this one) so that cache won't ever get in between |
But if I write to the uncached mirror, mustn't the HDMA read from the uncached mirror too?
And how do I tell the compiler to put the array in an uncached location?
And how can I tell the processor to not cache the memory location where the array is?
And a different question: Is uncached reads/writes slower than normal reads/writes performed on the GBA ARM7 or the NDS ARM7 (those that doesn't have cache to begin with)?
_________________
Exceptions are fun
#176182 - elhobbs - Wed May 04, 2011 1:24 pm
dma always bypasses the cache. the easiest way is to use: Code: |
void *memUncached(void *address);
|
it converts a main memory pointer to it's uncached mirror address.
so declare your array as you normally would - either as a global or use malloc/new (so it will be in main memory rather than the stack) then pass that pointer through memUncached and use that result for all of your writing.
I am not sure about the ARM7 bit. however, there is no reason to use the uncached mirror on the arm7 - since it does not have a cache. also, dma and sound buffers expect the normal main memory address not the uncached mirror pointer. so only use the uncached pointer to fill your buffer.
#176183 - LOst? - Wed May 04, 2011 5:56 pm
elhobbs wrote: |
dma always bypasses the cache. the easiest way is to use: Code: | void *memUncached(void *address);
|
it converts a main memory pointer to it's uncached mirror address.
so declare your array as you normally would - either as a global or use malloc/new (so it will be in main memory rather than the stack) then pass that pointer through memUncached and use that result for all of your writing.
I am not sure about the ARM7 bit. however, there is no reason to use the uncached mirror on the arm7 - since it does not have a cache. also, dma and sound buffers expect the normal main memory address not the uncached mirror pointer. so only use the uncached pointer to fill your buffer. |
Thank you! That's a very clear answer :)
EDIT: I am very happy with the HDMA results. However, I had problems with the first scanline as usual, but solved it in a way that I really would like to understand.
When having an unknown number of background offsets updated each scanline, the first scanline must be a dynamic range. That's done on the GBA hardware by using swiCopy() bios function. Now on the GBA it works perfectly, but on the NDS real hardware, the first scanline is completely ignored. I fixed it by calling DC_FlushRange() with that dynamic range right before the swiCopy() call. The result was perfect on all versions of NDS. But I am also using swiCopy() and swiFastCopy() to copy OAM entries, both in the game loop, and in the VBlank, and none of them have problems with the cache. I can flush, or I can ignore the flush, and I get the same result. The big question is, why is that first scanline so sensitive? It has always been sensitive, even on the GBA. I remember spending days back in the GBA era just to try to get the first scanline to appear. I had to DMA it using DMA 0 to make it work then. That works with the NDS too, but I wanted to do it like.... you know.... Nintendo... does it.
EDIT: Maybe because the OAM entries are defined as SpriteEntry struct array, and my scanline buffer is just a u16 array? Is there any aligning needed, and how to do it?
_________________
Exceptions are fun
#176185 - Ruben - Thu May 05, 2011 7:04 am
From the 'symptoms' you've described, this is a common problem with HDMA.
As the horizontal blank occurs AFTER a scanline, this means that:
-entry[0] is displayed as the SECOND line (ie, line 1)
-entry[239] is the first line displayed [unless you change the settings in V-Blank]
Either that, or you're spending too much time in V-Blank and eating into H-Blank before you start HDMA.
#176189 - sverx - Thu May 05, 2011 9:38 am
Hblank doesn't take place on VBlank, thus your
- entry[0] is the value that is read on the 1st HBlank (so after 1st line)
- entry[190] is the value that is read on the 191th HBlank (so after the line before the last one of a DS screen)
in my code (for instance when I want to define a circular window) I usually send the entry[0] value(s) right when VBlank happens, and I make HBlank start reading values from entry[1], that's simpler for me :)
#176191 - Ruben - Thu May 05, 2011 10:35 am
Oops, heh. I need more sleep. =P
#176192 - LOst? - Thu May 05, 2011 10:49 am
Yes I know about that issue. When I first began programming for the GBA, I had no idea what was the official way of getting that first scanline to work.
At one point, I thought the Vcount interrupt was the only way, but the game I was trying to do used the Vcount interrupt for something else at the same time.
When I moved on the the NDS, I had no knowledge of the data cache and how it operated. It made things, that GBA could handle fine, bogus.
And this time around, I already had a working HDMA game, but I am trying to construct a special effect (that I have been wanting to do since I started with GBA and NDS programming), and the game that uses these effects also does alot of other things at the same time, making it a challenge to get things in the right order, not to mention the game is a GBA game, so when I convert its ways, I need to count for the cache.
The order of things are quite interesting. When to do sprites, when to upload sprites, when to set background offsets, when to do HDMA. It is SO interesting that a public tutorial might be done in the future. I am allowed to disassemble code in my country under the current laws. But I will not post it here if it is against gbadev's policy. The way I am doing this is the Nintendo way, but as far as I have heard, there is no known official way, so it is therefore a trade secret.
Please note that the reason I own all the NDS versions up to DSi XL is to program them in the correct way (Nintendo 3DS doesn't have HDMA hardware, so I am not interested in that product. Please don't make a rumor of that info until it is confirmed by future homebrew!), and for every succession, I feel very happy! It is my chance to do games like they did during the SNES time when I was little and dreamed about making games.
The problem is for me to figure out when I need to flush the cache. I think it is a compiler thing this time.
Using the two bios functions:
CpuFastSet
CpuSet
according to http://nocash.emubase.de/gbatek.htm#biosmemorycopy
is what I need to do.
I use them to copy my OAM buffer into 0x7000000 during Vblank, and the result doesn't matter if I am flushing the OAM buffer range with libNDS's DC_FlushRange() or not.
But the first scanline is totally ignored if I am not flushing the range of the scanline buffer with DC_FlushRange().
I will do some simple test and edit this post with the results...
Test 1 is to word-memcpy (using a simple for-loop) the first scanline, but read from the uncached mirror memory location:
*SUCCEEDED* - First scanline shows up the correct frame.
Test 2 is to word-memcpy (using a simple for-loop) the first scanline without Flushing the range:
*FAILED* - First scanline is completely ignored.
Test 3 is to word-memcpy (using a simple for-loop) the first scanline that has been flushed:
*SUCCEEDED* - First scanline shows up the correct frame.
Test 4 is to use swiCopy half words the first scanline without Flushing the range:
*FAILED* - First scanline is completely ignored.
Test 5 is to use swiCopy half words the first scanline, but read from the uncached mirror memory location:
*SUCCEEDED* - First scanline shows up the correct frame.
Test 6 is to use swiCopy half words the first scanline that has been flushed:
*SUCCEEDED* - First scanline shows up the correct frame. This is the one I am currently using.
I have defined my scaline buffer like this (global):
Code: |
u16 ScanBuffer [SCREEN_HEIGHT * 2];
|
I have defined my OAM buffer like this (global);
Code: |
SpriteEntry OAMBuffer [SPRITE_COUNT];
|
Where SpriteEntry is a typedef union defined in sprite.h of lib NDS, and SPRITE_COUNT is defined in that same header as 128.
_________________
Exceptions are fun
#176195 - Ruben - Thu May 05, 2011 12:12 pm
Well, for starters, I wouldn't use the BIOS's memory copy function because...
Quote: |
BUG: The NDS uses the fast 32-byte-block processing only for the first N bytes (not for the first N words), so only the first quarter of the memory block is FAST, the remaining three quarters are SLOWLY copied word-by-word. |
That said...
Yeah, you'll definitely want to play it safe when copying data (specially to I/O) so it's best to flush the cache (or read from uncached memory), otherwise it may read older data which may be the cause of the glitch you see.
Also, unless you're doing costly calculations, you don't need a double buffer. As long as you can complete all calculations and fire the HDMA before V-Blank finishes, you're fine.
#176198 - sverx - Fri May 06, 2011 11:33 am
BTW if you still suspect there could be other problems apart from the cache related ones, you could try disabling the arm9 data cache completely and test that. I know it's possible to do that, but I can't tell you how to do that because I've never done it :|
Check this:
http://nocash.emubase.de/gbatek.htm#armcp15protectionunitpu
#176201 - LOst? - Fri May 06, 2011 10:24 pm
sverx wrote: |
BTW if you still suspect there could be other problems apart from the cache related ones, you could try disabling the arm9 data cache completely and test that. I know it's possible to do that, but I can't tell you how to do that because I've never done it :|
Check this:
http://nocash.emubase.de/gbatek.htm#armcp15protectionunitpu |
Wow, that could actually pinpoint alot of problems for many people, and I suggest it should be added "a debug function" to libNDS to turn the cache off completely for the purpose of locate cache related bugs.
_________________
Exceptions are fun