#17572 - Lupin - Wed Mar 10, 2004 2:00 pm
I thought about saving AND instructions by just letting overflow my registers, but i don't know if it would work because i don't know what happens on overflow errors :(
My fixed point value looks like this:
0000 00XX XXXX XXYY FFFF FFFF FFFF FFFF
where X = 8 bit coordinate for heightmap (256x256)
Y = extra 2 bit for texmap (1024x1024)
F = fraction
Now i thought of turning it into this:
XXXX XXXX YYFF FFFF FFFF FFFF FFFF FFFF
This would be no problem because i don't multiply my fixed point values... but i don't know if i would get a random number if the value overflows.
At the moment i use this:
Code: |
add r5,r5,r1
add r6,r6,r2
add r7,r7,r3
add r8,r8,r4
mov r11,#0x3FFFFFF
and r5,r5,r11
and r6,r6,r11
and r7,r7,r11
and r8,r8,r11
|
Saving the 4 ANDs would be great...
#17575 - Lupin - Wed Mar 10, 2004 3:01 pm
don't mind about this post... it works
#17585 - Miked0801 - Wed Mar 10, 2004 7:03 pm
As long as you don't set flags, you won't throw anything into carry and your normal adds don't use it anyways so your fine. Overflow goes into the bitbucket - as you stated...
#17591 - Lupin - Wed Mar 10, 2004 8:40 pm
Is there a fast way to cut off the first 8 bits? I am doing it by shifting now.
If someone got some spare time please can you take a look at my source code and help me a little? I really want to get this optimized as much as possible (reduce register usage and such)...
http://home.arcor.de/lupin003/voxel.txt
The interpolation part is a little bit long and i also got to push/pop 4 registers in my loop because the code renders in mode4 and therefor i need 4 registers for my rays over the voxel terrain (x/y positions and x/y velocities).
Please give me some hints if possible :)
#17602 - Miked0801 - Wed Mar 10, 2004 11:58 pm
Hi Lupin. In looking at this code for about 10 minutes, I don't have any direct changes for you as there's too much going on. In general though, I do have some "hints" that should help you.
You're using mov way too often as a loader/shifter. You need to figure out a way to get those shifts later on in the code.
In my opinion, you're trying to do too much at one time - you're out of registers and the way to fix this is to do a bit less at one time. Perhaps do only pixel at a time, store the result in RAM, do second, combine. This will free you some registers and hopefully get rid of that nasty push/pop in the middle.
at nxtskypixel:, I'd do those strh's as a str or better yet stm command. Even on a 16-bit bus, a stm command will beat a bunch of strh 's in terms of speed.
Easy fix: Don't store r3 and r12 in the initial push/pop. Neither of these are expected pristine by the C compiler.
Others?
#17630 - Lupin - Thu Mar 11, 2004 3:38 pm
great, thank you!
You are right with the push/pop, i already found a way to get rid of them (though it will require me to do 1 ldr and 2 mul instructions when i grab the color for the second pixel). The idea with the stm is great, it will also look way better... though i have yet to figure out how to use stm :)
#17632 - Lupin - Thu Mar 11, 2004 3:51 pm
i took a look at the stm instruction, but it seems like it only allows me to increment by 4 (32 bit) and not by 240 and i can also only write each register 1 once :(
#17636 - Miked0801 - Thu Mar 11, 2004 6:13 pm
hehe. Didn't realize you were doing your strips vertical - is there anyway to switch it to a horizontal algorithm? Seems like that would better play into the strengths of ARM
#17639 - tepples - Thu Mar 11, 2004 7:16 pm
You can draw strips vertically by stretching and rotating the screen, the same method used for full-screen mode 5.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#17640 - DekuTree64 - Thu Mar 11, 2004 7:50 pm
Here's one of those nice mov-removing tricks to save a couple of cycles:
Code: |
;;;;;;load position
ldr r5,posX
ldr r6,posY
add r7,r5,r3,lsl #DSHIFT ;copy for second ray
add r8,r6,r4,lsl #DSHIFT
add r5,r5,r1,lsl #DSHIFT ;needed if we not start from 0 distance (skip some steps)
add r6,r6,r2,lsl #DSHIFT |
And here:
Code: |
add r3, r1, #0x0100 ;advance Y by 1
mov r3, r3, lsl #16
mov r3, r3, lsr #16 |
r3 will never be negative after that add, right? That means the upper bits are all 0, except for maybe bit16 if the add overflows the first 16 bits, so you could do this:
Code: |
add r3, r1, #0x0100 ;advance Y by 1
bic r3, r3, #0xff0000 |
Or even use 0x10000, since the first bit is the only one you have to worry about.
As for freeing up registers, you should definately do something about r10 sitting there being a counter through the entire loop. Either push it onto the strack, or sneak the counter into another register. Your deltas are always between -65536 and 65536, right? that means you can left shift them 14 spaces and asr back 14 later to restore them to normal. With that, you can store your counter in the lower 8 bits of one. You'll need to loop backward though, instead of going from 0 to 119, breaking if ==120, go from 119 to 0, breaking if ==-1. So then at the end of your loop, subtract 1 from the delta you put the counter in the bottom bits of, then check bit8 of the counter, because if your lower at least 9 bits were all 0, like 0xXXXXX000, and you subtract 1, you'll get 0xXXXXXFFF, and so you know that bit8 through the start of your other value are all set to 1 if your counter was 0 before you subtracted. Careful though, when that happens your delta will get 1 subtracted from it, so be sure to add 1 back if you're going to use it again. It looks like you're loading new deltas after looping though, so it shouldn't be a problem.
Or, another way to pack values is to sacrifice a little accuracy and store your deltas as 14-bit fraction plus one for integer plus one for sign, so they fit in 16 bits. Then you can store them like 0xXXXXYYYY where X is the first X delta and Y is the first Y delta, then use a second reg for the second X and Y deltas. If you don't mind a tiny bit of error added to your values, you can then just add that to your X pos, ASRed by 14, which leaves the lower 2 bits set to the Y delta's upper 2 bits, but does get the rest of the X delta's bits set back to 16-bit fixed point. Otherwise you'll need to mov first to get those lower 2 bits cleared.
You'll have mov and then add to get the Y deltas though, but hey, you saved 2 regs. You'll have to do something else with the counter though, because every bit of the delta regs is used.
If you're a hacker, you could store the counter to a PC-relative temporary position while you loop, and then because you don't need the stack for anything, store it to a PC-relative as well, and then you have 4 free regs, which is what you were getting with the push/pop.
Careful though, if any of your interrupts switch to the user stack, they'll go boom. But if they always use the IRQ stack, it's perfectly fine to play with r13 in your code, as long as you restore it by the end.
EDIT: Oh, and another hackish way to get a bunch of regs is to use msr/mrs to switch to FIQ mode. I think it's r8-r11 and r14 that get bank-swapped. The ARM ARM says bad things will happen if you don't keep the processor status bits intact when you set the mode with an immediate value, so you'd need to have one normal reg with the value to switch to FIQ mode, and an FIQ reg with the value to switch back, but I've tested switching back and forth with immediate values and it worked fine, even on hardware.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#17641 - Lupin - Thu Mar 11, 2004 8:23 pm
wow, you have a lot of good ideas, though i won't be able to implement everything (sounds quite straightforward to me)
Well, can i just design my deltas AND my position values like this:
0000 0000 0000 YYYY YYYY YYXX XXXX XXXX
Where 0 are cleared bits and Y is my Y coordinate and X my X coordinate? This way i can store my X AND Y in 1 register, this makes 4 registers for both rays instead of 8 and i also save 2 add operations! I think i will try that... wow, you gave me a lot of good ideas!
#17644 - Lupin - Thu Mar 11, 2004 8:37 pm
lol, i just forgot that i would not have any precission at all since i would have no fraction :)
Well, i started rewriting everything, i was able to save a lot of registers, but the only bad thing now is that it doesn't work anymore :(
I hope i will finish a working version by tomorrow, but i am sure it's going to be very fast!
#17686 - Lupin - Fri Mar 12, 2004 5:17 pm
I updated my code, you can find it here:
http://home.arcor.de/lupin003/voxel.txt
Now i don't use stack anymore, i didn't pack the deltas, i decided to pack both counters into a register.
Well, i am quite contently with the current state of my code, but there is always room for more optimizations :P
You can download the current state of my rom here:
http://home.arcor.de/lupin003/voxel.gba
I think I am going to add some enemies next and make a simple game out of it :)
#17694 - Miked0801 - Fri Mar 12, 2004 7:07 pm
You can pre-load your posX and posY in the ===drawing part ===.
Doing so allows you to use mla to multiply and add at the same time - saving you a cycle and instruction per. I'd suggest the same earlier, but you are shifting at the same time so no go there without a bit of register work.
Also, consider using count down timers for your loops - that way you get free compare against zero checks without having to call cmp all the time.
More as I get more cycles free myself :)
Mike
#17705 - Lupin - Sat Mar 13, 2004 12:03 am
Well, i would need to find a way to save the shift left here:
ldrb r8,[r11,r7]
mul r7, r9, r7
add r7, r7, r8, lsl #8
Well, i could change my data to be 16 bit and leave the lower 8 bits cleared, but that would be dumb :)
Thx for your hint with the mla instruction, i already wondered what's the name for it in arm asm (i tried mad, because that's what i know from vertex shaders :P)
#17706 - DekuTree64 - Sat Mar 13, 2004 12:26 am
According to the ARM manual, mla takes one more cycle than mul anyway, which is the same as a specific add takes, so there's really no point in using it aside from saving space. Even so, you might as well use it here:
Code: |
mul r8, r9, r3
ldr r11, posY
add r8, r8, r11
mul r9, r2, r9
ldr r11, posX
add r9, r9, r11 |
by doing this:
Code: |
ldr r11, posY
mla r8, r9, r3, r11
ldr r11, posX
mul r9, r2, r9, r11 |
And here:
Code: |
;Calculate effective screen address
; p += TIMES_120(NewHeight);
mov r11,r7,lsl #8 ;multiply NewHeight (Y screen coordinate)
sub r11,r11,r7,lsl #4 ;by 240
add r7,r14,r11 ;r14 = VBuffer(r14) + YValue(r11) |
you might as well do this:
Code: |
rsb r7, r7, r7, lsl #4
add r7, r14, r7, lsl #4
|
And as Mike said, loop backward. No reason to waste time comparing when you can use a subs to do it for you.
Aside from those minor optimizations, it looks pretty darn good. Nice to see it in action too, although you must get rid of that texture map. You can't see how pretty the full resolution is when the texture pixels are so big^_^
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#17709 - poslundc - Sat Mar 13, 2004 12:43 am
DekuTree64 wrote: |
According to the ARM manual, mla takes one more cycle than mul anyway, which is the same as a specific add takes, so there's really no point in using it aside from saving space. |
You're also saving having to fetch-decode-execute an additional instruction, which in IWRAM means one less cycle (more if other memory areas).
Dan.
#17710 - Lupin - Sat Mar 13, 2004 12:58 am
Well, the mla also looks better :)
Deku, thanks for the hint with the rsb instruction (another instruction that i did not even know about :))!
Ok, i now upload the newest version with your optimizations and go to bed now.
Deku, it should look a little bit better now, i noticed that i did something wrong with selecting the deltas, now i have more accuracy in my trigonometry table. The texture map is 1024x1024, i could have used a larger map but that wouldn't be practicable (look at the rom size :P).
http://home.arcor.de/lupin003/voxel.gba and voxel.txt
btw, did you try this on emu or gba? It looks way different on hardware... I can't believe that the color difference is just because of the gba screen, because my commercial roms appear (almost) the same color on PC as when i play the original cart on my gba...
Edit: i just noticed that the updated version doesn't work out good on hardware... damn it :/
Edit: now it works, just had to change add to sub... it looks way better now (at least it appears so to me)