gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

DS development > Parts of screen refuse to update

#71511 - HyperHacker - Mon Feb 13, 2006 7:06 am

This is a problem that has plauged every DS app I've written so far. I set both screens to mode 5 and use BG3 to display graphics; it simply works like a 256x192 16BPP bitmap. Because drawing directly into VRAM produces flicker, I keep a second bitmap in memory, and use dmaCopyWords to copy it into VRAM during VBlank.

The problem is, sometimes certain portions of the screen around the corners simply will not update. If I draw something, then erase it to black, pieces of what I drew are still visible even though reading from the bitmap in memory shows that the pixel value is 0x8000, which should be black. They keep displaying the old image until I draw something else there, which is especially odd, as it should be copying the bitmap into VRAM every VBlank whether I've modified it or not. The parts that don't get updated seem to be random, but have some patterns:
  • Starting from the very top left and very bottom left corners (positions 0,0 and 0,191) there are always at least two pixels in a horizontal line that don't update.
  • The problem always occurrs near the corners of the screen.
  • The non-updated portions are often rectangular, and always wrap around the edge of the screen (starting on the right, and continuing on to the next row on the left, as the bitmap is a one-dimensional array).


I can draw as many different things as I want, and they all show up fine, until I try to clear the screen to black (0x8000) at which point some of the pixels just don't change. This can be seen in my GBA Booter app; switching from any border to None will leave pieces of the border on the screen. What's more is it's not just black. If I try to clear the screen to any colour at all, it will do this, but when I'm actually drawing images, it works fine. Also this never happens if I only clear outside of those areas (as mentioned, it only happens in the corners). The problem isn't in the function that does the clearing either, as I've used several different methods including writing directly to the bitmap.

The problem seems to be that certain writes to the bitmap are being ignored. However I've used all manner of methods to ensure they aren't:
  • Displaying the pixels' colour value on the screen and ensuring that it's correct
  • Delaying DMA until all writes to the bitmap are complete
  • Delaying writes to the bitmap until DMA is complete
  • Adding a while loop to the function that clears the bitmap to keep writing each pixel until its value actually is the value being written


I'm completely stumped as to what the heck is going on here. @_@

Relevant code, all on ARM9:
Setting up video and interrupts:
Code:
   videoSetMode(MODE_5_2D | DISPLAY_BG3_ACTIVE);
   videoSetModeSub(DISPLAY_SCREEN_OFF); //Yes, I'm only using one screen on the DS. Crazy ain't it?
   vramSetMainBanks(VRAM_A_MAIN_BG_0x6000000, VRAM_B_MAIN_BG_0x6000000, VRAM_C_LCD, VRAM_D_LCD); //Put both VRAM banks at the same place, so borders work
   MainScreenBuf = CreateGraphicBuffer(SCREEN_WIDTH,SCREEN_HEIGHT);
   Coords = (8 << 16) | 16;

   while(IPC->arm9desc != A9_GBASCREEN);
   IPC->arm9desc = 0;

   if(IPC->arm9data)
   {
      TopScreen = false;
      lcdMainOnBottom();
      IPC->arm7data = PM_BACKLIGHT_TOP;
   }
   else
   {
      TopScreen = true;
      lcdMainOnTop();
      IPC->arm7data = PM_BACKLIGHT_BOTTOM;
   }

   IPC->arm7desc = A7_POWEROFF;
   while(IPC->arm9desc != A9_ACK);
   IPC->arm9desc = 0;

   BG3_CR = BG_BMP16_256x256;
   BG3_XDX = 1 << 8;
   BG3_XDY = 0;
   BG3_YDX = 0;
   BG3_YDY = 1 << 8;
   BG3_CX = 0;
   BG3_CY = 0;

   SUB_BG3_CR = BG_BMP16_256x256;
   SUB_BG3_XDX = 1 << 8;
   SUB_BG3_XDY = 0;
   SUB_BG3_YDX = 0;
   SUB_BG3_YDY = 1 << 8;
   SUB_BG3_CX = 0;
   SUB_BG3_CY = 0;

   //Init interrupts
   REG_IME = 0; //Disable interrupts while changing them
   IRQ_HANDLER = Interrupt; //Set handler callback
   REG_IE = IRQ_VBLANK | IRQ_HBLANK; //Interrupt on vblank only
   REG_IF = ~0;
   DISP_SR = DISP_VBLANK_IRQ | DISP_HBLANK_IRQ;
   REG_IME = 1; //Enable interrupts

   swiWaitForVBlank();
   swiWaitForVBlank();
   swiWaitForVBlank(); //Let things set up


VBlank handler:
Code:
   else if(REG_IF & IRQ_VBLANK) //VBlank interrupt
   {
      KeysPressed = IPC->buttons_pressed; //Best to keep a local copy, since the ARM7 may modify it
      KeysHeld = IPC->buttons_held;

      if(SwapBuffers)
      {
         //while(MainScreenBuf->InUse);
         //MainScreenBuf->InUse++;
         dmaCopyWords(0,MainScreenBuf->Pixels,BG_GFX,(SCREEN_WIDTH*SCREEN_HEIGHT) << 1);
         //while(dmaBusy(0)) swiDelay(1);
         //MainScreenBuf->InUse--;
         SwapBuffers = false;
      }

      VBLANK_INTR_WAIT_FLAGS |= IRQ_VBLANK; //Signal that vblank interrupt has been processed
      REG_IF |= IRQ_VBLANK; //Signal that vblank interrupt processing is done. We need to trigger a write even though this shouldn't change the value.
   }

(and yes, I am setting SwapBuffers, also tried ignoring it and just updating every frame.)

Graphic functions involved:
Code:
/*
Creates a graphic buffer
CPU: ARM9
Inputs:
   -Width, Height = Size of buffer
Returns: Pointer to buffer or NULL on fail.
Notes:
   -Only 16-bit-colour buffers are currently supported.
   -Be sure to free the buffer when you're done with it. (Just free(Buffer) will do.)
   -Buffer's content is undefined when created. If you need it to start filled with
    a given colour, fill it yourself after creating it.

*/
GraphicBuffer* CreateGraphicBuffer(uint16 Width, uint16 Height)
{
   GraphicBuffer* G = (GraphicBuffer*)memalign(4,sizeof(GraphicBuffer) + (((uint32)Width * (uint32)Height) << 1));
   if(!G) return NULL;
   G->Width = Width;
   G->Height = Height;
   return G;
}


/*
Clears a graphic buffer to a specified colour
CPU: ARM9
Inputs:
   -Buffer: Pointer to graphic buffer
   -Colour: RGB15 colour
*/
void ClearGraphicBuffer(GraphicBuffer* Dest, uint16 Colour)
{
   Colour |= 0x8000;
   int i;
   uint16* P = Dest->Pixels;
   for(i=0;i<(Dest->Width * Dest->Height);i++)
   {
      (*P) = Colour;
      P++;
   }
}


/*
Draws a solid rectangle on a graphic buffer
CPU: ARM9
Inputs:
   -Dest = Destination buffer
   -DX, DY = Destination coords
   -W, H = Dimensions of rectangle
   -Colour = RGB15 colour
*/
void FillRect(GraphicBuffer* Dest, uint16 DX, uint16 DY, uint16 W, uint16 H, uint16 Colour)
{
   Colour |= 0x8000;
   int x,y;
   uint16* DP = Dest->Pixels + DX + (DY * Dest->Width);
   for(y=0;y<H;y++)
   {
      for(x=0;x<W;x++)
         (*(DP + x)) = Colour;
      DP += Dest->Width;
   }
}

#71529 - DekuTree64 - Mon Feb 13, 2006 8:30 am

Sounds like parts of your buffer are still in the cache, and need to be flushed out before DMA will see them. Looks like DC_FlushRange is the function to do that.

It may actually be faster to copy the buffer to VRAM with CPUFastSet (or a custom ldmia/stmia loop) instead, since flushing data from the cache does take time too. I'd be interested to know the results if you try any speed tests on it.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#71530 - HyperHacker - Mon Feb 13, 2006 8:40 am

That does seem logical, since I'm doing a lot of reading and writing of the corner pixels (generating a diagonal gradient). What's this DC_FlushRange though?

I'll try a custom loop tomorrow. I need to sleep now.

#71794 - wintermute - Tue Feb 14, 2006 3:43 pm

DC_FlushRange flushes an area of cache to memory to allow DMA and arm7 to read it.

http://devkitpro.sourceforge.net/devkitProWiki/libnds/a00027.html
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#71893 - HyperHacker - Wed Feb 15, 2006 5:06 am

OK, so... I tried to write that loop:
Code:
uint16* UncachedPixels;
[...]
else if(REG_IF & IRQ_VBLANK) //VBlank interrupt
{
   KeysPressed = IPC->buttons_pressed; //Best to keep a local copy, since the ARM7 may modify it
   KeysHeld = IPC->buttons_held;

   if(SwapBuffers)
   {
      UncachedPixels = MainScreenBuf->Pixels + 0x400000;
      //dmaCopyWords(0,MainScreenBuf->Pixels,BG_GFX,(SCREEN_WIDTH*SCREEN_HEIGHT) << 1);
      __asm(".arm\n"
         "ldr r1,UncachedPixels\n"
         "ldr r2,BG_GFX\n"
         "ldr r3,=49152\n" //256x192
         "loop:\n"
         "ldrh r0,[r1, #2]!\n" //Read from r1 and add 2 to it
         "strh r0,[r2, #2]!\n" //Store at r2 and add 2 to it
         "sub r3,r3,#1\n" //r3--
         "bne loop\n"
         ".pool\n"
         );
      SwapBuffers = false;
   }
   [...]

Probably all wrong, but anyway... After a bit of tinkering I got rid of all the errors it complained of... and then it came up with more. (God how I hate that.) What the hex does "Error: internal_relocation (type: OFFSET_IMM) not fixed up" mean? I get it almost every time I try to do ASM. >_<

[edit] Hah, it is a cache problem. Putting DC_FlushRange in fixed it. I still want to get the ASM loop working though, as I imagine this is rather slow (and even if not, I want it as fast as possible).

Strange though, I tried just passing the pointer + 0x400000 to dmaCopyWords(), but it didn't make a difference. (As Chishm told me a while ago, 0x02400000-0x02800000 is an uncached mirror of 0x02000000-0x02400000.)

#71898 - chishm - Wed Feb 15, 2006 5:56 am

HyperHacker wrote:
Strange though, I tried just passing the pointer + 0x400000 to dmaCopyWords(), but it didn't make a difference. (As Chishm told me a while ago, 0x02400000-0x02800000 is an uncached mirror of 0x02000000-0x02400000.)

Uncached from the CPU's perspective. Reading / writing the uncached memory range from the CPU will always bypass the cache, but using the cached region may put it into dcache. DMA can only access the main memory, so to it, the cache doesn't even exist and it is as though both ranges are uncached. However, because it can't access the cache, any changes not flushed will not be seen by the DMA copy.
_________________
http://chishm.drunkencoders.com
http://dldi.drunkencoders.com

#71900 - DekuTree64 - Wed Feb 15, 2006 6:19 am

HyperHacker wrote:
What the hex does "Error: internal_relocation (type: OFFSET_IMM) not fixed up" mean?

Usually means undefined identifier or syntax error of some sort, which in this case is probably BG_GFX. You'll need to use the straight address without any C-style casts around it.

Quote:
Hah, it is a cache problem. Putting DC_FlushRange in fixed it.

Yay, that was just a first guess based on the symptoms :)

Quote:
I still want to get the ASM loop working though, as I imagine this is rather slow.

Yup, ldrh/strh is a major waste of time. At least use ldr/str, but ldmia/stmia is best. I'd recommend doing the copy loop in a .s file, rather than messing with that icky looking inline asm. Try compiling this in a .s for starters:
Code:
.global FastCopy
.arm
.align 2
FastCopy:
stmfd sp!, {r4-r11, lr}   @ Save regs on stack (r0-r3,r12 don't need to be preserved)

@ Do stuff here

ldmfd sp!, {r4-r11, lr}   @ Restore regs
bx lr // Return to caller

And extern it in a .cpp file like
Code:
extern "C" void FastCopy(const void *src, void *dest, u32 bytes);

Or nix the "C" if it's a .c file. Those 3 parameters will be in r0-r2, and you're free to modify them however you like.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#71902 - HyperHacker - Wed Feb 15, 2006 6:44 am

Haha, you're right again. I thought BG_GFX was just a constant, but it's actually a macro with a cast. No wonder. :-p

Now I haven't really worked with ARM before... I understand those ops save and restore r4 through r11 and lr (r13?) at sp (r14?) and then update sp, correct? So I would want to then do something like this:
Code:
stmfd sp!,{r4-r11,lr}
loop:
ldmia r0!,{r3-r14}
stmia r1!,{r3-r14}
sub r2,r2,#1
bne loop
ldmfd sp!,{r4-r11,lr}
bx lr

Is that right? I'm gonna try it now...

Argh. I hacked it up a bit to fit in where I had it inline before (gonna move it to a .s file and make it an actual function once it works):
Code:
__asm(".arm\n"
"ldr r0,UncachedPixels\n"
"ldr r1,=0x6000000\n"
"ldr r2,=49152\n"
"stmfd sp!,{r4-r11,lr}\n"
"loop:\n"
"ldmia r0!,{r3-r13}\n"
"stmia r1!,{r3-r13}\n"
"sub r2,r2,#10\n"
"bmi loop\n"
"ldmfd sp!,{r4-r11,lr}\n"
".pool\n"
);

Still getting that internal_relocation error. >_<

[edit] Some quick commenting out of code revealed that the first line is the problem. Without that it assembles fine (but of course doesn't do much). Apparently I'm not going about accessing UncachedPixels correctly; how should I be doing this? (Even though I plan to turn this into a generic FastCopy function later, I'd still love to know.)

[another edit] Ah, I've tried pretty much every symbol on the keyboard in front of 'UncachedPixels' to no avail. Best I got was "address expected" using @. Google turns up two methods: register uint16* UncachedPixels asm("r0") which should put it into r0 automatically, but it doesn't seem to work (the program runs but no display on the screen), and uint16* UncachedPixels external name 'UncachedPixels' which it just rejects right off the bat. (Looking at the page I'm not even sure that was for C.) Also a lot of pages mentioning to use _ or % or $, which don't work, and even more telling me how to use floating-point registers.
Also I fixed a bit of the code up there.

#71909 - gladius - Wed Feb 15, 2006 8:18 am

ldr r0,variableName gets converted into ldr,[pc + offset]. If the variable is out of the range of the pc (offset has a limited range) then that instruction won't work. You can do something like this to fix that problem:

ldr r0,=variableName
ldr r0,[r0]

However, you also need to make sure a pool section is defined close enough to the ldr as well. Inline assembly is a pita. Far easier to make it a function and let the compiler sort out putting the right variable in r0 at the right time, then you can just access it from the file like so:

In the C file:

void myFunc(u32 blah);
...
myFunc(UncachedPixels);

In the Asm file:
.GLOBAL myFunc
myFunc:
@ now r0 contains blah and I may do what I please :)
bx lr

#71911 - HyperHacker - Wed Feb 15, 2006 8:34 am

Ooh, this is a good one.
gladius wrote:
ldr r0,=UncachedPixels
ldr r0,[r0]

Gets an 'undefined reference to UncachedPixels' error on line 888, when the only references to it are on 876 and 872. Again no amount of different symbols are able to get it recognized. Although the variable is declared right before the ASM, so this shouldn't make much difference. Using the DMA method I printed the address of UncachedPixels, and tried just hardcoding it into the ASM to see if even that would work... it went into an endless loop, which it normally doesn't. (Usually it just runs but nothing shows on the screen.) The only thing I noticed was that it should be subtracting 44 rather than 10, since the registers are 32-bit and I use 11 of them, but that didn't help.

How exactly would I include an ASM source file anyway? And out of curiosity, are ARM7 and ARM9 interchangable instruction-wise? (That is, could the same code run on both, assuming it didn't depend on things only available to one?)

#71916 - DekuTree64 - Wed Feb 15, 2006 9:11 am

Gladius' method only works for globals, so if UncachedPixels is a local variable, you have to specify it with funky inline asm-ness:
Code:
asm(
"mov r0, %0\n"
"ldr r1, =0x6000000\n"
// ... rest of the code
: // Outputs
: "r" (UncachedPixels) // Inputs
: ); // Registers corrupted

Ick.

Building .s files is much like building .c files. Just call gcc and it will figure it out by the extension.

ARM7 and ARM9 are for the most part code-compatible, although ARM9 does have a few extra instructions (blx, clz, DSP instructions), and can do ARM/THUMB state switch on loads to the pc, rather than having to load to a temp and then bx. Not sure if gcc uses any of that though.

Also- don't load to r13, that's the stack pointer. You can do it if you save it to a global or somewhere pc-relative first, but it's still dangerous since some interrupt handlers switch to the user stack, and interrupts can happen at any point unless you disable them.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#72015 - HyperHacker - Thu Feb 16, 2006 12:21 am

Yech, that is a mess. When I assemble the .s file with gcc, how do I include it? Right now I compile arm7.c and arm9.c (which include whatever files they need) into arm7.o and arm9.o, compile those into arm7.elf and arm9.elf, use objcopy to turn them into arm7.bin and arm9.bin, and link them together with ndstool. Where would the .s file fit in here? I imagine gcc would give me a .o output, and then I objcopy it somewhere?

[edit] Woot, I got it working inline.
Code:
__asm(".arm\n"
   "ldr r0,=MainScreenBuf\n" //We can't say MainScreenBuf->Pixels
   "ldr r1,=0x400004\n" //so add 4 to get to Pixels and 0x400000 to get to uncached RAM
   "ldr r0,[r0]\n"
   "add r0,r0,r1\n"
   "ldr r1,=0x6000000\n" //destination
   "ldr r2,=98304\n" //length
   "stmfd sp!,{r3-r12,lr}\n"
   "loop:\n"
   "ldmia r0!,{r3-r12,lr}\n"
   "stmia r1!,{r3-r12,lr}\n"
   "subs r2,r2,#44\n" //copy 44 bytes at a time (!), and we need those flags updated
   "bpl loop\n"
   "ldmfd sp!,{r3-r12,lr}\n"
   ".pool\n"
);
I know it's probably over-copying (in fact I'm seeing some bugs likely caused by memory corruption), but hey, it works. ;-) I'll fix it up later, once it's in its own function. I figured out the trick to getting at those pixels - MainScreenBuf is global, and although I can't do "=MainScreenBuf->Pixels", I can do "=MainScreenBuf" and add 4 (the number of bytes before Pixels) plus 0x400000 (for uncached memory). Hackish, but like I said I'll fix it up later.
Only thing is it's still having that cache problem... I guess DC_FlushRange is the only way to fix it?

#72035 - wintermute - Thu Feb 16, 2006 2:16 am

Using the uncached region of memory by adding 0x400000 is utterly pointless if you're just going to read it. In order to avoid cache problems you need to use the uncached region for both reading and writing or make use of the DC_Flush and/or DC_Invalidate functions.

The function you've shown here would have worked had you not tried to copy from uncached memory.
_________________
devkitPro - professional toolchains at amateur prices
devkitPro IRC support
Personal Blog

#72074 - HyperHacker - Thu Feb 16, 2006 6:25 am

Ah, so that's it... Simple enough, just a few tweaks to my graphics code. Thanks!

#74755 - HyperHacker - Tue Mar 07, 2006 11:44 am

Yargh... Writes to uncached memory are just being completely ignored. :-(

[edit] Did some more testing. No matter what I write, reading returns some strange value. What I have is a pixel-plotting function, drawing to a buffer in memory, and then the buffer is copied to VRAM during vblank. What I do is draw a pixel in colour 0x1234, then read back its colour and display it. The pixel drawing code sets the high bit, so it should come back as 0x9234. Instead, if I write to uncached memory, then read from cached, I get 0. What's really weird is if I read from uncached, and set the high bit, I get 0xF114... but if I don't set it, I get the address I was writing to! Which is even weirder because it's a full 32-bit address, when the function that reads only returns uint16! ?_?

Maybe there's some sort of pointer problem in my code?
Code:
/*
Draws a pixel on a graphic buffer
CPU: ARM9
Inputs:
   -Buffer: Pointer to graphic buffer
   -X, Y: Position to draw at
   -Colour: RGB15 colour
*/
void DrawPixel(GraphicBuffer* Dest, uint16 X, uint16 Y, uint16 Colour)
{
   if((X >= Dest->Width) || (Y >= Dest->Height)) return; //Don't draw off-screen
   uint16* B = (Dest->Pixels + 0x400000) + X + (Y * Dest->Width);
   (*B) = Colour | 0x8000;
}


/*
Returns the colour value of a pixel on a graphic buffer
CPU: ARM9
Inputs:
   -Buffer: Pointer to graphic buffer
   -X, Y: Position of pixel
Returns: RGB15 value of colour with high bit set, or 0 if out of bounds
*/
uint16 GetPixel(GraphicBuffer* Dest, uint16 X, uint16 Y)
{
   if((X >= Dest->Width) || (Y >= Dest->Height)) return 0;
   uint16* B = (Dest->Pixels + 0x400000) + X + (Y * Dest->Width);
   return (*B); //| 0x8000;
}


Also, the code which copies this data to VRAM must be getting different results (same whether it reads from cached or uncached), as anything written to uncached memory doesn't show up at all, meaning it's reading zero (or at least some low value).

[another edit] Interesting results here.
uint16 x = 0x69;
uint16* xp = &x + 0x400000;
printxy2(MainScreenBuf,0,(Coords >> 16),(Coords & 0xFFFF),"%08X %08X %08X",x,(*xp),xp);

If I omit the " + 0x400000" on the second line, the output is the value of x twice, then its address, as you would expect. With it included, the output is the value of x once, and its address twice! O_o