#162692 - spinal_cord - Tue Sep 09, 2008 1:55 pm
Just a quick noobish question, is there a way to stop color 0 from being transparent?
Failing that, is there a super-hyper-mega fast way of doing this? -
Code: |
int temp=0;
for(temp=0; temp<512*282; temp++)
{
if(bufmem[temp]==0)bufmem[temp]=16;
}
|
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162695 - Maxxie - Tue Sep 09, 2008 4:03 pm
spinal_cord wrote: |
Just a quick noobish question, is there a way to stop color 0 from being transparent?
|
By disabling background color and lower backgrounds. If there is nothing the NDS can set into the transparent areas, it leaves them how they are (with color)
Quote: |
Failing that, is there a super-hyper-mega fast way of doing this? -
Code: |
int temp=0;
for(temp=0; temp<512*282; temp++)
{
if(bufmem[temp]==0)bufmem[temp]=16;
}
|
|
If you are doing this in a buffer in 16+bit access memory, you could read a full (h)word, check each byte and write a full (h)word. Saves you memory accesses, as you use the full information on one read instead of throwing parts away again. Reduces the looping amount, but increases the arithmetic useage per iteration.
_________________
Trying to bring more detail into understanding the wireless hardware
#162697 - Miked0801 - Tue Sep 09, 2008 6:00 pm
Hyper fast suggestions:
Code: |
u32 *tempPtr = (u32 *)bufmem;
for(int i=0; i<512*282/2; i++)
{
u32 foo = *tempPtr;
if((foo & 0xFFFF) == 0) foo |= 0x10;
if((foo >> 16) == 0) foo |= 0x100000;
*tempPtr = foo;
}
|
This will be a little faster than just doing the 16-bit test you are already doing. Taking it further though:
Code: |
stmfd sp!,[r4-r11,lr]
mov lr, 512*282/2/12
ldr r0,tempBuf
loop1:
ldm r0,[r1-r12]
tst r1,0xFFFF
orrne r1,r1,0x10
tst r1,0xFFFF0000
orrne r1,r1,0x100000
tst r2,0xFFFF
orrne r2,r2,0x10
tst r2,0xFFFF0000
orrne r2,r2,0x100000
...
tst r12,0xFFFF
orrne r12,r12,0x10
tst r12,0xFFFF0000
orrne r12,r12,0x100000
stm r0!,[r1-r12]
subs lr,lr,1
bne loop1
ldmfd sp!,[r4-r11,lr]
|
I'm at home so the op-codes may be a bit off, but the idea is correct. This will get the access down to roughly 4.1ish cycles per half-word. The method you have is 8 cycles on a non-write and 9 cycles on a write.
Most of the savings is from the loop unroll. Unrolling yours 8-12 times and doing just half-words (ldrh, cmp 0, strneh) would be 5 cycles on a write and 6 on a pass with about .1 cycles added for overhead, so the unrolled 32-bit method is a bit faster.
I'll look at this more later.
#162698 - Maxxie - Tue Sep 09, 2008 6:36 pm
Miked0801 wrote: |
Hyper fast suggestions:
This will be a little faster than just doing the 16-bit test you are already doing. Taking it further though:
|
I'd think if he uses paletted colors (not the map entries as the resolution is too hight), he'd use 8bit color per pixel.)
Nontheless, the effect is the same (even greater)
_________________
Trying to bring more detail into understanding the wireless hardware
#162701 - Miked0801 - Tue Sep 09, 2008 7:32 pm
Yep - Here's the breakdown for that
ldrb 3 ; bufmem[temp]
cmp 1 ; == 0
orne 1 ; if, assign 16
strneb 1 or 2 - probably around 1.1 ; if, assign 16
adds 1 ; temp++
cmp ; < 512*282
bne 3 ; next loop
--------
11 - 12 cycles per color (11.1 I'd guess meaning 1 in 10 bytes is '0')
to
ldm 0.3125 cycles ; 15 cycles for 48 bytes
tst 1
orrne 1
stm 0.2917 cycles ; 14 cycles once every 48 bytes
sub 0.021 ; 1 cycle every 48 bytes
bne 0.0625 ; 3 cycles every 48 bytes
--------------
2.6875 cycles per color
so a little more than 4x faster than original code.
If assembly is daunting, just changing the code backwards to 0 SHOULD save 1 cycle per, but many compilers are too stupid to catch this. Also, unrolling say 8 times would get another couple cycles quickly.
#162704 - TwentySeven - Tue Sep 09, 2008 10:10 pm
Impressive.
It's even faster if you fix the problem at the art pipeline stage.
#162707 - Maxxie - Wed Sep 10, 2008 5:23 am
One would think that this solution only works if you actually have the data earlier then runtime.
There is no hint that lets us assume that.
_________________
Trying to bring more detail into understanding the wireless hardware
#162708 - TwentySeven - Wed Sep 10, 2008 6:38 am
When you hear hoofbeats, think Horses, not zebras :)
I'll bet you a cookie it's not run time generated image data.
#162709 - eKid - Wed Sep 10, 2008 7:52 am
Code: |
.macro cvt_wd src, dest, base
bic \dest, \base, \src, lsl#4 // 1
bic \dest, \src, lsl#3 // +1
bic \dest, \src, lsl#2 // +1
bic \dest, \src, lsl#1 // +1
orr \dest, \src // +1
.endm // [5]
//------------------------------------------------------------
convert_data: // convert_data( u32* data )
//------------------------------------------------------------
stmfd sp!, {r4-r6,lr} // 4 save regs
ldr r6,=0x10101010 // 2 load base number
mov r3, #512*282/4/16 // 1 fixed length
ldr r2,=tempBuff // 2 get address
loop: // [repeat 2256 times]
//-------------------------------------------------------------
.rept 8 // [repeat 8 times]
ldrd r0, [r2] // 2
cvt_wd r0,r4,r6 // 5
cvt_wd r1,r5,r6 // 5
strd r4, [r2], #4 // 2
.endr // [14 cycles]
subs r3, #1 // 1
bne loop // 3
//-------------------------------------------------------------
ldmfd sp!, {r4-r6, pc} // 7
// [261708 cycles (1.81/entry)]
.pool
|
this is fun :P
Assumes that input is 4-bit values stored in 8-bit entries.
The 1.81 cycles/entry is assuming all memory access time (32-bit) is 1 cycle, so it should probably slow down a bit for cache misses, and slow down a lot if the memory is in uncached VRAM...
#162710 - spinal_cord - Wed Sep 10, 2008 11:09 am
You guys seem to have had fun with this. What it is, is the 8bit image buffer from frodods, im using 2 layers resized at different offsets and 50% alpha, the problem is that colour 0 is transparent and cant be alpha'd with anything.
I'm still trying to track down where the buffer is being created in the first place, but i'm not a super-leet coder like a lot of you guys, so much of it just looks like a complete mess to me.
as for assuming 4bit numbers, i should have made that clearer, its all 8bit pixel values, apparently the palette is repeated 16 times.
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162712 - tepples - Wed Sep 10, 2008 11:51 am
spinal_cord wrote: |
What it is, is the 8bit image buffer from frodods |
If you know that your source uses colors 0-15, as a C64 emulator might, you could just OR all 32-bit words in the buffer with 0xF0F0F0F0 so that it uses colors 240-255. But once you get this port feature complete and mostly tested, you can announce that you have made "the first working 64 emulator for DS". (AVGN could tell you the ambiguity is intentional.)
spinal_cord wrote: |
im using 2 layers resized at different offsets and 50% alpha |
So you're doing α-Lerp (alpha blending of a layer against itself, possibly combined with flicker scaling, to approximate linear interpolation), as seen in my demo and then in nesDS.
spinal_cord wrote: |
the problem is that colour 0 is transparent and cant be alpha'd with anything. |
Blend has two layers: "source" (top) and "destination" (second-from-top). Try setting BLEND_CR to blend the "backdrop" (color 0 areas) as a "destination" layer (BLEND_DST_BACKDROP). That worked for my prototype implementation of α-Lerp.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#162713 - spinal_cord - Wed Sep 10, 2008 1:44 pm
but wouldn't i loose the other layer if i blend to the backdrop? i went for this option because i couldnt get the flickering working properly.
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162718 - tepples - Wed Sep 10, 2008 7:37 pm
If you have layer 1 over layer 2, you get (layer 1 * 8 + layer 2 * 8)/16, which is what you currently have (and want). But if you blend with the background as a DST, you can get (layer 1 * 8 + backdrop * 8)/16 and (layer 2 * 8 + backdrop * 8)/16, which is also what you want.
What was working improperly about your attempts to flicker? I want to help you with this because adding a bit of flicker makes α-Lerp look even better.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#162719 - spinal_cord - Wed Sep 10, 2008 7:56 pm
i dont know much at all about interrupts, i tried using the exsisting vbl in the frodo source already, but it didnt seem to be ticking anywhere near fast enough, and all attempts to add my own failed. i also tried using a timer, but that want in sync either, so it jittered now and then.
Should i pm you the source so you can have a look?
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162721 - Miked0801 - Wed Sep 10, 2008 9:38 pm
ekid, your code does not check if the value is non-zero before adding in the offset. The original code leaves values 1-255 alone but changes 0 to 16. Yours just changes things :)
Loads also take 3 cycles, not 2 as per gbatek.
Also, where's 4-bit specified? It's a good guess though looking at the 16-assignment value. Finally, eat a few more registers and use ldm/stms to further eat away your cycles per unit.
#162722 - Maxxie - Wed Sep 10, 2008 9:45 pm
Miked0801 wrote: |
Also, where's 4-bit specified? It's a good guess though looking at the 16-assignment value. |
Actually not. 16 translates to 0 if it is stored into a 4 bit unsigned integer, in which case you wouldn't change anything.
_________________
Trying to bring more detail into understanding the wireless hardware
#162740 - Miked0801 - Thu Sep 11, 2008 6:21 pm
Was assuming a 4-bit color with the 4 bit as a flag - probably alpha. But you are correct, if straight 4-bit, 2 colors per byte, then that would do evil.
#162781 - spinal_cord - Fri Sep 12, 2008 8:21 pm
so can
Code: |
bufmem = (uint8*)malloc(512*512); // forgot to mention this bit before
int temp=0;
for(temp=0; temp<512*282; temp++)
{
if(bufmem[temp]==0)bufmem[temp]=16;
}
|
not be done super fast?
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162792 - keldon - Fri Sep 12, 2008 11:45 pm
Hmm; can you not just increment them all and change the palette instead?
#162793 - spinal_cord - Sat Sep 13, 2008 1:03 am
The palette is repeated 16 times so all 256 entries are taken up by the 16 color palette. I'm not in any way sure why this is, but i assume its to speed up something or other. I tried moving all the entries up 1, but then the last entry wraps round to 0 so then sometimes light grey is then transparent instead.
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage
#162913 - Cearn - Tue Sep 16, 2008 9:22 am
Instead of adding, try ORring by 16 (0r 0x10101010 for speed). This will make everything use the odd palette groups. ORring by 0xF0F0F0F0 would shift everything to the 0xF0-0xFF range, leaving the rest of the palette for other purposes.
#162914 - spinal_cord - Tue Sep 16, 2008 9:32 am
Doesn't matter any more, i got flickering working.
_________________
I'm not a boring person, it's just that boring things keep happening to me.
Homepage