#16389 - corranga - Sat Feb 14, 2004 1:53 pm
I have searched the forums and read almost everything on division, but am at a bit of a loss. I've been trying to get division working (preferably fixed point 20:12). I need it to run FAST as I am attempting some 3D stuff on the gba.
I was trying to adapt the Division 32bit/32bit=32bit code from http://www.peter-teichmann.de/ahinte.html to work on the gba (I've not done any assembly at all really so I got a bit lost.
Using the official arm assembley reference and some other stuff I found
I presume ; (semicolon) is a comment, and for arm asm on the gba I need to change it to @ (at sign) and all the branchs have to have a : (colon) as does the function name ( ie |FDIV| becomes FDIV: )
Then I presumed the #&80000000 was hexidecimal for the highest bit (signed bit?) and changed it to #0x80000000
When I compile using GCC I get the following error:
"path"/Temp/ccUqQO9X.o: In function `FDIV':
"path"/Temp/ccUqQO9X.o(.text+0x2c): relocation truncated to fit: R_ARM_PC24 *UND*
collect2: ld returned 1 exit status
What am I doing wrong?
Chris
_________________
If virtual reality is ever on a par with reality, I want to be Bomberman! :D
#16392 - DekuTree64 - Sat Feb 14, 2004 4:47 pm
Hmm, well I tried converting it and it worked fine. Unfortunately it only does what the BIOS div does, so there's not much point in messing with it. The 64/32=32 is what you need, but it looks like it uses some kind of special assembler function to expand that bit of code that's there, but I don't know exactly what it does, so I can't try it.
Anyway, if you still want to get the other working, that error is usually due to trying to access something out of range (like ROM from IWRAM), or trying to branch to a missing label, which is likely the case. Check and make sure you got the : after all the non-commented label-looking lines.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#16399 - poslundc - Sat Feb 14, 2004 10:26 pm
It could quite possibly be a spelling error in one of your labels. That is the kind of thing that is likely to cause a linker error.
Dan.
#16797 - KevinW - Wed Feb 25, 2004 4:26 am
Don't be fooled into taking verbatim that a routine is "fast" because it's called, someone says, or it's simply labeled "fast".
For instance the "fast" divide in BIOS ROM might be faster then the standard GCC library routine but I found the library divide in the ARM ADS to be much faster!
Plus what you do is put these algorithms in IWRAM to get the best performance. If you setup your make in ADS to output special mapping/list files at the link stage, you can see what library(s) it's using for your math. You can then force these files to load into IWRAM (as I do) by using a scatter file.
You should be able to do this as well with GCC, but I personaly have not tried it.
The complier must automatically generate some veneer code so it can make calls/jumps to IWRAM from ROM. While this adds some extra cycles for the indirection, it?s a good thing after all since you?ll be able to call 32 routines from 16 bit ROM.
Incidentally, ADS only includes library functions in the binary that are actually used. So you don't have to wory so much about a libaray wasting space on the ADS.
Furthermore for my experience in this sort of thing is that you'd benefit greatly from breaking down your math and taking short cuts when feasible.
If for example you are making a division with a constant (in fixed point or not) you could multiply by a reciprocal instead.
Or use tables to a greater (eliminating an entire algorithm) or lesser (eliminating part of an algorithm) degree. For instance, you have 160 vertical scan lines so it might be possible to eliminate a few expensive operations by using a divide array table of 160 entries..
If you are lucky you can reorder some calculations into prime powers of two and use simple shifts (for both divides and multiplies) instead.
Also what you might want to do is cascade the last two timers into a utility 64bit cycle timer.
Then you can precisely time and compare various functions.
This alone will tell you what is ?fast? and what is not..
Please note that to add to the complexity to this is that some algorithms give better performance in some circumstances than others. Like if your denominator is small some division functions might be faster then say a function that?s better designed for general perpose use.
Note: you CAN use software floating point emulation libraries for setup and test code!
Yes they are prohibitedly slow for a working 3D graphics engine on the GBA, but works great for testing, profiling and proving things. Again you can use the same speedup by forcing the math emulation libraries in IWRAM (space permiting).
3D ?Software rendering? was a big subject a few years ago (back in the Doom, and Quake 1; days, etc). Look these up on the web for some ideas. Some of the tricks might apply to what you are doing.
Besure to check out the fast sqrt() routine I posted on the main www.gbadev.org page. This type of algorithm is an order of magnatude faster then the BIOS routine.
P.S. No I don?t work for ARM Corp., as you might thing from my so many posts on ARM ADS :-P
#16816 - Lupin - Wed Feb 25, 2004 3:30 pm
ADS is great... but there is almost no documentation out for it. I think i will try to make my projects able to compile with both compilers.
Maybe you should try to clearify what type of division is actually the fastest, because i am also not sure about what division would be the best...
The problem for me is that i am too lazy to set up a profiler for the gba :)
#16819 - Lupin - Wed Feb 25, 2004 3:37 pm
Kevin, i have a question about the ARM ADS debugger, how can i use swi commands with it? Everytime i call an SWI opcode it says that the opcode is unrecognized and stops debugging :(
#16821 - poslundc - Wed Feb 25, 2004 3:45 pm
Lupin wrote: |
Maybe you should try to clearify what type of division is actually the fastest, because i am also not sure about what division would be the best... |
The fastest division is by a power of two; then you can just shift. ;)
Seriously, though, the point is that it depends on your specific requirements. From fastest to slowest, the different techniques would be:
- Shifting (powers of 2 only)
- Multiplication by constant reciprocal (division by constant only)
- Reciprocal LUT (only works within the range of your LUT)
- Custom, optimized divide function
- BIOS divide function (SWI #6)
- gcc division
Note that while I place gcc as the slowest, gcc with any optimizations turned on will do the first and second technique if it can. (Not sure about without the optimizations.)
You can easily search the forum to find a way to improve on the BIOS function for your own optimized divide routine, but I find that if you're doing a division that you can't use any of the first three techniques for then you may as well just use the BIOS call. If it ever really becomes a problem than you can optimize it later. :P
Dan.
#16835 - Torlus - Wed Feb 25, 2004 6:39 pm
corranga wrote: |
When I compile using GCC I get the following error:
"path"/Temp/ccUqQO9X.o: In function `FDIV':
"path"/Temp/ccUqQO9X.o(.text+0x2c): relocation truncated to fit: R_ARM_PC24 *UND*
collect2: ld returned 1 exit status
What am I doing wrong?
Chris |
Hmm looks like some kind of "well-known" linker issue. Get Jeff Frohwein's crtls from his site (www.devrs.com) and follow his intructions (you can also search gbadev.org's forum for more information).[/b]
_________________
GBA,GC,NGPC,GP32,FPGA,DS stuff at http://torlus.com/
#16836 - Miked0801 - Wed Feb 25, 2004 6:50 pm
He also has a nice Arm Division routine on his site that - while 2x larger than the ARM one - it generally outperfoms it (it adds a binary search ot get the MSb of the larger term to cut down on division loop time. On larger numbers, it's faster, on smaller, it's a touch slower.)
#60757 - keldon - Mon Nov 14, 2005 2:19 pm
Code: |
LSB(a) {
// (if zero) |= n <<< means ORR c regN in Arm ASM
m = a&-a;
m & 0xaaaaaaaa
(if zero) lsb|= 1;
m & 0xcccccccc
(if zero) lsb|= 2;
m & 0xf0f0f0f0
(if zero) lsb|= 4
m & 0xff00ff00
(if zero) lsb|= 8
m & 0xffff0000
(if zero) lsb|= 16
return lsb
} |
Code: |
MSB(a){
b = a>>1;
b |= b>>2;
b |= b>>4;
b |= b>>8;
b |= b>>16;
m=a & !b;
return LSB(m);
} |
m = 1 << MSB(a) so the conditional OR's will convert (1 <<MSB) to MSB
there doesn't seem to be an LSB command for the arm, am i right?
#60861 - kusma - Tue Nov 15, 2005 2:53 pm
poslundc wrote: |
- Reciprocal LUT (only works within the range of your LUT)
|
Just a quick note: if you count the leading zero-bits, then you can quite easily normalize the number, and use a lut for the entire range. you don't really need a big lut to get acceptable results for stuff like 3d-projections either.
#60872 - poslundc - Tue Nov 15, 2005 6:55 pm
kusma wrote: |
poslundc wrote: |
- Reciprocal LUT (only works within the range of your LUT)
|
Just a quick note: if you count the leading zero-bits, then you can quite easily normalize the number, and use a lut for the entire range. you don't really need a big lut to get acceptable results for stuff like 3d-projections either. |
Interesting idea. So is it pretty much as follows:
- Count the leading zeros in your denominator
- Obtain the difference between that value and the range of your table (say it has 256 entries, you'd subtract 8)
- Downshift the denominator by that value, and look it up in your table to obtain the reciprocal
- Upshift the reciprocal (presumably into a long-long)
- Multiply
Presumably you'd have to throw in a test for negative denominators as well and negate your numerator if you encountered those values. And depending on the precision of your LUT's values and the size of your terms it might be necessary to save the upshift until after the multiply.
How does this work out in practice, and how does it stack up against other techniques, like interpolating between LUT values?
Dan.
#60876 - keldon - Tue Nov 15, 2005 7:44 pm
EDIT: Removed, see below
Last edited by keldon on Wed Nov 16, 2005 11:06 am; edited 1 time in total
#60967 - keldon - Wed Nov 16, 2005 10:40 am
i have optimised the MSB routine -- might have NZ(EQ) statements the wrong way around because i am used to using Z and NZ - it is supposed to AND if (AND R0) is NZ
also i don't know the right way to load the address of Data into R3, would you have to load the high 16, SHL 16, and then add the low 16? Also I am assuming that LDR R1, [R3+1] will give me Data+4, otherwise all of those instructions need to be altered;
Code: |
; r0 = num
; ret: R2 = MSB(r0)
;
; R1: loaded data
; R2: MSB
; R3: Data ptr
MSB:
XOR R2 R2
LDR R3 Data
LDR R1, [R3]
AND R1 R0 R1
AND EQ S R0 R0 R1
ADD EQ R2 16
LDR R1, [R3+1]
AND R1 R0 R1
AND EQ S R0 R0 R1
ADD EQ R2 8
LDR R1, [R3+2]
AND R1 R0 R1
AND EQ S R0 R0 R1
ADD EQ R2 4
LDR R1, [R3+3]
AND R1 R0 R1
AND S EQ R0 R0 R1
ADD EQ R2 2
LDR R1, [R3+4]
AND R1 R0 R1
AND EQ S R0 R0 R1
ADD EQ R2 1
ret
Data:
DW 0xffff0000, 0xff00ff00, 0xf0f0f0f0, 0xcccccccc, 0xaaaaaaaa |
#60980 - FluBBa - Wed Nov 16, 2005 1:12 pm
Wasn't this dissused here?
16 instructions without any memory accesses.
_________________
I probably suck, my not is a programmer.