gbadev.org forum archive

I'm working on some custom compression routines for graphics, mostly just to learn more about it. So I've written a tool that converts a graphic into an array of compressed u8's. This array gets decompressed into VRAM by a C function.

My question is: what kind of hoops does the compiler jump through to deal with u8s? Would I be better off writing out an array of u16s and then doing manual masking to handle the decoding? What about u32s? This is enough processing that it is probably worth optimizing if possible.

My assumption is that the compiler would just spit out the same ASM (or probably better) that would be generated by all my manual u16-to-two-u8's conversion work.

-Scott

There are native ARM instructions that operate on 8, 16 and 32 bit values.

Reading/writing one 8bit value takes the same amount of time as reading/writing a 16bit value, if the memory bus is 16bits or more.

If moving to 16 or 32 bit values would require lots of masking and shifting, then there is probably not much speed improvement in it for you.

Last edited by ampz on Wed Nov 10, 2004 10:56 pm; edited 1 time in total

GCC will interpret u8s as requiring single-byte addressing. If your algorithm can work comfortably manipulating four bytes at a time without looping then it's probably worth it to use u32s instead.

If you would have to loop while constantly keeping track of how many bytes you have left in your "buffer" of a u32, then it probably isn't worth it. Such schemes can work but unless you can establish a rule in your encoding like "no zero values permitted" then more often than not are more trouble than they're worth.

If this routine is heavy-use/time-critical you may want to consider optimizing it as ARM assembly and putting it in IWRAM.

Dan.

If it reads data from the cart, and write data to VRAM, then there is little reason to consider 32bit values, since both the cart and VRAM have 16bit busses.

(one 32bit read from a 16bit bus is automatically converted into two 16bit reads by the memory bus hardware)

ampz wrote:

If it reads data from the cart, and write data to VRAM, then there is little reason to consider 32bit values, since both the cart and VRAM have 16bit busses.

A 32-bit load from ROM takes advantage of sequential access, which (apparently, according to DekuTree64's recent tests) two sequential 16-bit loads do not.

Having a 16-bit bus also doesn't help because if you're using the u8 datatype then the compiler will still use 8-bit loads, so you are only getting 50% performance from the bus.

Dan.

poslundc wrote:

ampz wrote:

If it reads data from the cart, and write data to VRAM, then there is little reason to consider 32bit values, since both the cart and VRAM have 16bit busses.

A 32-bit load from ROM takes advantage of sequential access, which (apparently, according to DekuTree64's recent tests) two sequential 16-bit loads do not.

I'am aware of that, but the total difference is not that incredible. Especially if the 32bit read requires extra instructions to be executed (potentially from the cart).

OK... how about that the timing for a LDR instruction is 1S + 1N, to which the waitstates are added. If you break a 32-bit load into two LDRH statements, in addition to having non-sequential waitstates you'll also have the additional 1S + 1N.

Dan.

but it will take 4 byte of ROM more! :O

I realized a signifigant speed improvement in our compression routines here when I switched them from u8 to u32 awareness. Hell, I started with something close to PuCrunch, made it 32-bit aware, and behold it goes something like 10x faster than it used it. Our RLE/LZ hybrid roughly doubled in speed going from u8s to u16s with almost no extra code complexity (though going to u32s would have made it harder to write). Moral: Reading from memory in most (most) compression routines is your biggest bottleneck. A small decrease there yields huge performance gains.

Miked0801 wrote:

Moral: Reading from memory in most (most) compression routines is your biggest bottleneck. A small decrease there yields huge performance gains.

Depends alot on the memory.
There is little gain if you are working with single-cycle memory.

If you're working with compression, you most likely are reading from ROM yes? Reading into IWRAM then decompressing wastes time and memory. EWRAM bottlenecks worse than ROM.

Without trying to sound too dumb (with all this talk of bottleneck jiggery pokery), is the moral of this thread to NOT use u8's? Is using u32's fastest? I assume the video mem should use u16 as it's got a 16-bit bus, but should I use u32 for everything else? Does it *really* make a difference?

Cheers,

Ben

If there's a moral of this thread, it's to not take it for granted that any particular size is going to be faster.

If you're having speed problems, there are lots of guys here that love to optimize things so feel free to share more details. If not, then don't worry about variable sizes -- just use int.

Large memory reads/writes are always faster or equaly fast as multiple small reads/writes. But the difference depends highly on the underlying memory bus configuration. If the use of large reads/writes means you have to do alot of extra shifting and masking, then there is not always a performance gain.

One way to gain significant performance is to load several words of data to a couple of registers using the ldmia instruction, then decompress the data to registers and store the result in VRAM using the stmia instruction.

The ldmia instruction should perform only a single non-sequential cart read, giving you a significant boost in read performance.

Use the data type that makes most sense. Arrays of data typically have a width that matches the purpose. For example, palette entries are best represented as u16s. If you are copying with DMA or the BIOS copy routines then the data type you specify in C won't have any impact on the performance of it.

Dan.

poslundc wrote:

Use the data type that makes most sense. Arrays of data typically have a width that matches the purpose. For example, palette entries are best represented as u16s. If you are copying with DMA or the BIOS copy routines then the data type you specify in C won't have any impact on the performance of it.

Well, that's asuming the data is properly aligned. A u8 array might not be aligned to 16 or 32bit boundaries.
I agree that you should normaly use the data type that makes most sense.

And that, unless you have a really good reason not to, align your data for ease of reading.

I tend to assume that GCC will automatically align my arrays for me... at least if they're large enough to merit it (ie. longer than a word); I could easily be taking that for granted, though.

Dan.

Btw, what's the deal when declaring local variables inside a function, that u16 takes more space in the compiled object and ROM than u32 or u8 ?

Should I stick to u32 and u8 inside functions ? Are these variables placed in the IWRAM, or does this happen because of the stack or memory alignment ? My memory on the subject is weak.

I'm using gcc, if that matters.

Load instructions that reference a 32-bit or an unsigned 8-bit value can use all the memory addressing modes. Load instructions that reference a signed 8-bit value or a 16-bit value can use only a subset of these modes, which means GCC has to generate additional instructions to begin calculating the effective address.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

Although this is true, it generally isn't a compelling reason not to use u16s. You need to be doing some pretty critical optimizations before the available addressing modes outweigh the value of using the appropriate word size. Also keep in mind that the width of the data type you specify in C doesn't affect stuff like DMA copying anyway.

Dan.

In the C99 standard there are a new set of data types. For example, there is a "16bit int or larger" data type where the compiler is free to use a larger type if it provides higher performance.

So .. on one of my project (a simple mode 0 sprite demo) the compiled ROM file took 7100 bytes when I just used u32 or s32 as the type of variables almost everywhere.

Then I converted the variable types to better match their contents (like coordinates x,y to s16, colors to u16, etc), and now the ROM file takes about 7500 bytes. Is this good as in better speed, or is this just bad ? I didn't touch any of the const sprite data or anything that is just copied with DMA.

What are the advantages of using u8's and u16's versus u32's ? I don't know the specs of the ARM7TDMI CPU so well that I could answer this question.

Refer to my last post. The subject has been done to death and there's truly nothing more to be said.

gbadev.org forum archive

Coding > Performance of u8 arrays

#29058 - ScottLininger - Wed Nov 10, 2004 10:25 pm

#29064 - ampz - Wed Nov 10, 2004 10:51 pm

#29065 - poslundc - Wed Nov 10, 2004 10:53 pm

#29068 - ampz - Wed Nov 10, 2004 11:01 pm

#29071 - poslundc - Thu Nov 11, 2004 12:01 am

#29089 - ampz - Thu Nov 11, 2004 7:20 am

#29100 - poslundc - Thu Nov 11, 2004 2:58 pm

#29104 - Lord Graga - Thu Nov 11, 2004 5:04 pm

#29110 - Miked0801 - Thu Nov 11, 2004 7:06 pm

#29117 - ampz - Thu Nov 11, 2004 9:21 pm

#29130 - Miked0801 - Fri Nov 12, 2004 12:45 am

#29215 - Wriggler - Sun Nov 14, 2004 2:00 am

#29218 - sajiimori - Sun Nov 14, 2004 2:17 am

#29223 - ampz - Sun Nov 14, 2004 3:01 pm

#29224 - poslundc - Sun Nov 14, 2004 4:16 pm

#29235 - ampz - Sun Nov 14, 2004 7:58 pm

#29359 - Miked0801 - Fri Nov 19, 2004 12:57 am

#29371 - poslundc - Fri Nov 19, 2004 4:34 am

#29548 - lgo - Sun Nov 21, 2004 7:48 pm

#29552 - tepples - Sun Nov 21, 2004 9:05 pm

#29556 - poslundc - Sun Nov 21, 2004 10:19 pm

#29558 - ampz - Sun Nov 21, 2004 10:41 pm

#29634 - lgo - Tue Nov 23, 2004 12:40 am

#29639 - sajiimori - Tue Nov 23, 2004 1:18 am