gbadev.org forum archive

I'm using mode 4 and all my images consist of only 16 colors (= 4 bits). In order to save EWRAM, I've written a program which "compresses" the image data: Even pixels are stored in the upper 4 bits and odd pixels are stored in the lower 4 bits of a byte.

Code:

bits 7 6 5 4 3 2 1 0
0 1 0 1 0 0 0 1 => left pixel color palett address = 0x5,
\ / \ / right pixel color palett address = 0x1
left right
pixel pixel

But there is a tradeoff: If I want to copy the pixels from EWRAM to VRAM, the image has to be "decompressed" using bit swifting operations. Another option would be to store all my images in ROM but without the compression. Which variant is the faster one?

Code:

void drawImage (u8 x, u8 y, const u8* image, u16 width, u16 height, Bool hFlip, u8 transparentColor)
{
width /= 2;
x /= 2;

u8 w, h;
u8 leftColor, rightColor;
if (hFlip == FALSE)
   {
   if (transparentColor == NO_TRANSPARENCY)
   {
      for (h=0; h<height; h++)
         for (w=0; w<width; w++)
         {
            leftColor = image[w+h*width] & 0x0F;
            rightColor = image[w+h*width] / 0x10;
            videoBuffer[(x+w)+120*(y+h)] = rightColor | 0x100*leftColor;
         }
   }
   else
   {
      for (h=0; h<height; h++)
         for (w=0; w<width; w++)
         {
            leftColor = image[w+h*width] & 0x0F;
            if (leftColor == transparentColor)
               leftColor = (0xF00 & videoBuffer[(x+w)+120*(y+h)]) >> 8;

            rightColor = image[w+h*width] / 0x10;
            if (rightColor == transparentColor)
               rightColor = videoBuffer[(x+w)+120*(y+h)] & 0x000F;

            videoBuffer[(x+w)+120*(y+h)] = rightColor | 0x100*leftColor;
         }
   }
}
else
{
   if (transparentColor == NO_TRANSPARENCY)
   {
      for (h=0; h<height; h++)
         for (w=0; w<width; w++)
         {
            leftColor = image[w+h*width] / 0x10;
            rightColor = image[w+h*width] & 0x0F;
            videoBuffer[(x+(width-1-w))+120*(y+h)] = rightColor | 0x100*leftColor;
         }
   }
   else
   {
      for (h=0; h<height; h++)
         for (w=0; w<width; w++)
         {
            leftColor = image[w+h*width] / 0x10;
            if (leftColor == transparentColor)
               leftColor = (0xF00 & videoBuffer[(x+(width-1-w))+120*(y+h)]) >> 8;

            rightColor = image[w+h*width] & 0x0F;
            if (rightColor == transparentColor)
               rightColor = videoBuffer[(x+(width-1-w))+120*(y+h)] & 0x000F;

            videoBuffer[(x+(width-1-w))+120*(y+h)] = rightColor | 0x100*leftColor;
         }
   }
}
}

I would suggest 16 color using sprites/tiles if it's not really necessary to use a bitmap mode...
Other than that, use ints for variables where you don't absolutely need char or short.
_________________
I probably suck, my not is a programmer.

Quote:

I would suggest 16 color using sprites/tiles if it's not really necessary to use a bitmap mode...

I would use one of the tile modes if I could, but the concept of my game doesn't allow me to do so.

Quote:

Other than that, use ints for variables where you don't absolutely need char or short.

Well... maybe I should really store all my images in ROM and use one int per pixel. :|

Kensai wrote:

Flubba wrote:

Other than that, use ints for variables where you don't absolutely need char or short.

... use one int per pixel. :|

What Flubba was referring to here is to use ints for local variables and function parameters, not for the images themselves.

Simply put, there are two kinds of variables: storage variables and worker variables. Storage variables are the things that take up space: globals, arrays, code, structs and such. These take up memory, and smaller will usually be faster. Workers are the things that the CPU actually does its operations on: local variables, including function parameters and temporaries. These do not take up memory, but are kept in the CPU's registers. The CPU tends to work faster when the worker vars are the same size as the registers, and this usually correlates to the 'int' datatype. In this case, smaller is usually slightly slower.

For example, pretty much all local variables and parameters are non-int-sized in your code. In extreme cases, I have seen this half the speed of an algorithm. That's probably not the case here, but it'd still be better to use 'int' or 'u32' for the locals.

Another thing that can help is to use local pointers that point as directly as possible to the memory you're working on. For example, instead of `videoBuffer[(x+w)+120*(y+h)]', create a local pointer to videoBuffer[x+120*y] outside the loop and index through memory using that pointer instead. This cuts down on the size of the statements inside the loop and can, on occasion, speed up the code by quite a bit.

The fastest method would be the one where you can copy the whole thing without any pixel manipulation at all (i.e., non-compressed). But since transparency requires you to consider pixels individually anyway, the extra compression won't make that big of a difference. It may even be faster because you only need to retrieve half the memory.

Oh, also: the left pixel in a VRAM halfword is actually the lower byte, not the upper.

Perhaps consider a different compression scheme? If your data is only at most 4 bits, a huffman compressor will get you better compression and decompression speed than what you have now. Or even RLE or LZ. I'm hesitant to optimize your decompressor when I'm pretty sure your choice of algorithm is a much bigger bottleneck than the code itself.

Aye! You are doing manual transparent color merging? That will be so slow! Sprites do this kind of thing for you in hardware. Failing that, using Mode 0 and doing the conversion math from pixel to 4-bit char space will be much better than this. You are not effectively using the GBA (DS?) hardware given to you. With any of the char modes, you'd be in much better shape.

But, if forced to use Mode 4, think large Look Up Table for speed. Perhaps reading the data 16-bits at a time if alignment allows:

Code:

u8 LUT[256][256];

for (h=0; h<height; h++)
for (w=0; w<width/2; w++)
{
int pixleColor = image[w+h*width];
*videoPtr = LUT[pixleColor];
}
}

The reading of tranparent data is still your bane though.

This will roughly triple the speed of writes without tranparency and the same idea could be applied to tranparency as well with work.

Kensai wrote:

I'm using mode 4 and all my images consist of only 16 colors (= 4 bits). In order to save EWRAM, I've written a program which "compresses" the image data: Even pixels are stored in the upper 4 bits and odd pixels are stored in the lower 4 bits of a byte.

Code:

bits 7 6 5 4 3 2 1 0
0 1 0 1 0 0 0 1 => left pixel color palett address = 0x5,
\ / \ / right pixel color palett address = 0x1
left right
pixel pixel

But there is a tradeoff: If I want to copy the pixels from EWRAM to VRAM, the image has to be "decompressed" using bit swifting operations. Another option would be to store all my images in ROM but without the compression. Which variant is the faster one?

It might be even faster (or perhaps "fast enough") and smaller just to use the BIOS' LZW routines on the native data.

Quote:

Code:

for (h=0; h<height; h++)
for (w=0; w<width; w++)
{
   leftColor = image[w+h*width] & 0x0F;
   rightColor = image[w+h*width] / 0x10;
   videoBuffer[(x+w)+120*(y+h)] = rightColor | 0x100*leftColor;
}
}

This here can be done quite a bit faster. First, since all pixels in an x-span is stored linearly both in the source and destination, you can move all multiplies out of the inner loop. Something like this:

Code:

for (h=0; h<height; h++)
u8 *dst = videoBuffer + x + 120 * (y + h);
for (w=0; w<width; w++)
{
int tmp = *image++;
*dst++ = (tmp & 0x0F) | ((tmp & 0xF0) << 4);
}
}

I swapped the high and low nibble here, because it became slightly faster (the low nibble stays in the same place). You can do further such optimizations if you unroll the inner loop one time, and think about the problem a bit. Unrolling should gain you a lot here as well anyway.

Thank you so much for your advice! I will use a 16x16 color LUT for images without transparency and an optimized version of my algorithm for images with transparency. Additionally I will minimize the usage of the last-mentioned algorithm by splitting up all images that contain transparency into smaller images with and without transparency.

gbadev.org forum archive

Coding > Image compression

#172314 - Kensai - Sat Jan 30, 2010 11:05 am

#172316 - FluBBa - Sat Jan 30, 2010 12:16 pm

#172317 - Kensai - Sat Jan 30, 2010 1:45 pm

#172323 - Cearn - Sat Jan 30, 2010 8:57 pm

#172339 - Miked0801 - Mon Feb 01, 2010 7:31 pm

#172347 - kusma - Tue Feb 02, 2010 11:29 am

#172353 - Kensai - Tue Feb 02, 2010 4:03 pm