#110532 - Joe_Sextus - Wed Nov 29, 2006 1:10 am
I have a set of functions that need to be optimized for speed and platform independence.
For Multiplication (by 10) I use
Code: |
x = (y << 3) + (y << 1) |
is there a similar way to do this for division (by10)?
Last edited by Joe_Sextus on Wed Nov 29, 2006 2:44 am; edited 1 time in total
#110541 - Dwedit - Wed Nov 29, 2006 2:28 am
short post:
Multiplying by ten is faster, don't bother with shifting and adding.
long post:
So we're trying to compare these two operations:
Code: |
add r0,r1,r1
add r0,r0,r1,lsl#3
mov r0,#10
mul r0,r1,r0
|
Both of these take 2S+1I cycles to complete.
In THUMB mode, the mov and multiply wins, since the shifting and adding version would need more instructions.
The ARM7 already has a multiply instruction, so you don't need to replace it with shifting and adding.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#110543 - Joe_Sextus - Wed Nov 29, 2006 2:54 am
I'm not as concerned about the multiplication, as I am with division.
What is the fastest way to do division and keep the platform independence? Is this the best way with gcc?
#110547 - DekuTree64 - Wed Nov 29, 2006 3:25 am
You can do reciprocal multiplication, like x*6554>>16. The 6554 there is (1<<16)/10 (which is 6553.6), rounded up. Be careful of overflowing 32 bits with the intermediate value though. Also, you'll start getting accuracy problems if x is too large. You could do a 64-bit multiply with (1<<32)/10 if necessary to deal with either of those.
I've heard before that GCC will automatically do reciprocal multiplies instead of real division when you divide by a constant, but I've never verified it personally.
And for the record, shift-add multiplies do have their place. For one thing, you can often squeeze in a free add with a non-shifted value, while mla takes an extra cycle to do it. More importantly though, you don't need a temporary register to store the value you're multiplying by. This can be a life saver in a tight inner loop if you'd have to swap things around on the stack otherwise.
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#110581 - kusma - Wed Nov 29, 2006 3:36 pm
Joe_Sextus wrote: |
I'm not as concerned about the multiplication, as I am with division.
What is the fastest way to do division and keep the platform independence? Is this the best way with gcc? |
gcc with optimizations turned on implement divisions by constants as long multiplies, so there's no real point in worrying about this.
#110591 - poslundc - Wed Nov 29, 2006 5:42 pm
kusma wrote: |
gcc with optimizations turned on implement divisions by constants as long multiplies, so there's no real point in worrying about this. |
In ARM mode it does the optimization... in Thumb mode, it does manual division :(
If you want to know how to combine shifts cumulatively to divide, this page goes into some detail (it's what I based my posprintf on). Suffice to say that it's far more mathematically complicated than multiplication, and you are usually far better off multiplying by a constant reciprocal when possible.
Dan.
#110658 - Joe_Sextus - Thu Nov 30, 2006 4:14 am
Thanks for the replies I think in this case the reciprocal approach may be best. But I discovered two case that I am not sure of how to handle.
Code: |
a = b / c
d = b % c |
Is this actually the best way to do this for platform independence and speed or is there a better way?
#110659 - Ant6n - Thu Nov 30, 2006 4:23 am
#110673 - gladius - Thu Nov 30, 2006 9:35 am
Modulus (the % operator) can be implemented as a multiply and divide as so:
d = b % c
d = b - (b / c) * c
So once you have your result a = b / c, you can easily compute modulus with an additional multiply and subtraction.
#110686 - Joe_Sextus - Thu Nov 30, 2006 2:07 pm
gladius wrote: |
Modulus (the % operator) can be implemented as a multiply and divide as so:
d = b % c
d = b - (b / c) * c
So once you have your result a = b / c, you can easily compute modulus with an additional multiply and subtraction. |
I feel dumb for not thinking of that.
#110739 - Miked0801 - Fri Dec 01, 2006 2:40 am
Many division routines actually have the modulus available to them upon termination - but for want of a place to return the value. Nothing like getting a divide and having the mod for free.
#110742 - Joe_Sextus - Fri Dec 01, 2006 2:51 am
Miked0801 wrote: |
Many division routines actually have the modulus available to them upon termination... |
I would have used the swi because of this, but I need the code to be independent of the GBA / DS.
#110743 - sgeos - Fri Dec 01, 2006 3:54 am
Joe_Sextus wrote: |
Miked0801 wrote: | Many division routines actually have the modulus available to them upon termination... |
I would have used the swi because of this, but I need the code to be independent of the GBA / DS. |
Use a wrapper function?
-Brendan
#110839 - Miked0801 - Sat Dec 02, 2006 2:58 am
The version of the divide I'm thinking of is a stand alone, ARM asm version. Basically, repeat 3 instructions per bit of the divide. When done you have the quotient and remainder both available for return. Jeff Frohein<sp> has a version of this divide on his site somewhere.
#110855 - Ant6n - Sat Dec 02, 2006 7:01 am
div10
; takes argument in a1
; returns quotient in a1, remainder in a2
; cycles could be saved if only divide or remainder is required
SUB a2, a1, #10 ; keep (x-10) for later
SUB a1, a1, a1, lsr #2
ADD a1, a1, a1, lsr #4
ADD a1, a1, a1, lsr #8
ADD a1, a1, a1, lsr #16
MOV a1, a1, lsr #3
ADD a3, a1, a1, asl #2
SUBS a2, a2, a3, asl #1 ; calc (x-10) - (x/10)*10
ADDPL a1, a1, #1 ; fix-up quotient
ADDMI a2, a2, #10 ; fix-up remainder
MOV pc, lr
#110871 - Cearn - Sat Dec 02, 2006 1:30 pm
Alternatively, compile the following as ARM code:
Code: |
#include <stdlib.h>
div_t div10(int x)
{
div_t y;
y.quot= x/10;
y.rem= x%10;
return y;
} |
In ARM the division and modulo will be turned into multiplications and subtractions, but it has the benefit of being portable and working for both positive and negative x). It does take a little longer than Ant6n's version because there's some stack-work involved.
#110881 - Joe_Sextus - Sat Dec 02, 2006 4:18 pm
I realized since my last post that both of my targets use versions of the ARM instruction set and that I will need division by a variable as well. So I googled for a division routine and found http://www.tofla.iconbar.com/tofla/arm/arm02/index.htm
My complete function (doesn't do signed division).
Code: |
.align
.arm
.global div
@void div(unsigned int retvalue, unsigned int dividend, unsigned int divisor, unsigned int modulus)
div:
cmp r2, #0 @if(divisor == 0)
beq .div_end @ return;
mov r0, #0 @retvalue = 0;
mov r3, #1 @placeholder = 1;
.start: @while(divisor < dividend)
cmp r2, r1 @{
movls r2, r2, lsl#1 @ divisor <<= 1;
movls r3, r3, lsl#1 @ placeholder <<= 1;
bls .start @}
.subtract @while(r1 - r2 > 0)
cmp r1, r2 @{
subcs r1, r1, r2 @ dividend -= divisor;
addcs r0, r0, r3 @ retvalue += placeholder;
movs r3, r3, lsr#1 @ placeholder >>= 1;
movs r2, r2, lsr#1 @ divisor >>= 1;
bcc .subtract @}
.div_end:
bx lr @return; |
Currently this code doesn't return the modulus, but it will when I finish it. (That's a simple mov r3, r1).
[EDIT]
Fixed typo.
#111084 - Miked0801 - Mon Dec 04, 2006 1:04 am
To add signed, just xor the starting 2 numbers and look at bit 31. If set, the result is negative, else is positive. Then ABS and go.
Another speed trick is to completely unroll the divide loop and jump into the code based on the number of signifigant bits. Man I wish I had the routine handy - I'd post it. ARM also has a routine they publish with their stuff that has a nice divider.
#111092 - keldon - Mon Dec 04, 2006 1:41 am
#111093 - Joe_Sextus - Mon Dec 04, 2006 1:45 am
I was thinking something similar to this for signed division
Code: |
void sdiv(signed int retvalue, signed int dividend, signed int divisor, signed int modulus)
{
bool negative = false;
if(dividend & 0x80000000)
{
negative = 1;
dividend *= -1;
}
if(divisor & 0x80000000)
{
negative = !negative;
divisor *= -1;
}
div(retvalue, dividend, divisor, modulus);
if(negative)
retvalue *= -1;
} |
Is there an easy way to change sign in ASM?
#111096 - keldon - Mon Dec 04, 2006 2:40 am
Joe_Sextus wrote: |
Is there an easy way to change sign in ASM? |
Code: |
int changesign(int a) {
return 0-a;
}
int changesign(int a){
return (~a)+1;
} |
#111101 - Joe_Sextus - Mon Dec 04, 2006 2:56 am
keldon wrote: |
Joe_Sextus wrote: | Is there an easy way to change sign in ASM? |
Code: | int changesign(int a) {
return 0-a;
}
int changesign(int a){
return (~a)+1;
} |
|
Are either of those more efficient than
Code: |
int changesign(int a)
{
return a * -1;
} |
#111103 - DekuTree64 - Mon Dec 04, 2006 3:12 am
Quickest in ARM:
Quickest in THUMB:
Multiplying is much slower and needs a temporary register:
Code: |
mov r1, #-1
mul r0, r1, r0 |
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku
#111195 - Miked0801 - Mon Dec 04, 2006 8:12 pm
The compiler is smart enough in every case I've come across to change:
x = x * -1
to
x = -x;
Which involves only 1 instruction.
And change sign even simpler is
int changeSign(int x)
{
return -x;
}
This does the 2s compliment for you.
#111201 - Joe_Sextus - Mon Dec 04, 2006 8:51 pm
Miked0801 wrote: |
The compiler is smart enough in every case I've come across to change:
x = x * -1
to
x = -x;
Which involves only 1 instruction.
And change sign even simpler is
int changeSign(int x)
{
return -x;
}
This does the 2s compliment for you. |
That will probably be the way I will do it.
If I want to passed return values into the function arguments, do I have to handle pointers in ASM, or does the C register keyword allow me to do that?
Code: |
void div(register unsigned int retvalue, register unsigned int dividend, register unsigned int divisor, register unsigned int modulus) |
#111210 - gladius - Mon Dec 04, 2006 10:14 pm
The arm calling convention by default puts the first 4 arguments that are 32 bits or less into r0-r3, so that scenario will "just work".
Also, in newer compilers the register keyword is completely ignored.
#111238 - Miked0801 - Tue Dec 05, 2006 3:43 am
#111241 - Ant6n - Tue Dec 05, 2006 4:17 am
in this regard it probably would be good if somebody mentions static inline functions, i dont remember how they work for sure though. I think its possible to just define a function like that in a header file
static inline int changeSign(int x)
{
return -x;
}
are there any compiler flags needed now??
#111248 - Joe_Sextus - Tue Dec 05, 2006 5:29 am
I have already looked at that. When I first looked at it I didn't understand it, until I realized that it is the same basic thing as mine only with unrolled loops.
Ant6n wrote: |
in this regard it probably would be good if somebody mentions static inline functions, i dont remember how they work for sure though. I think its possible to just define a function like that in a header file
static inline int changeSign(int x)
{
return -x;
}
are there any compiler flags needed now?? |
Why not just do this then
Code: |
#define changeSign(x) (-x) |
#111252 - poslundc - Tue Dec 05, 2006 8:12 am
Joe_Sextus wrote: |
Why not just do this then
Code: | #define changeSign(x) (-x) |
|
I could riff on the advantages of type-checking and debugging and whatnot, but instead I'll just pounce on the example you gave to illustrate the dangers of reckless macro use.
What do you think will be the result if you use your changeSign macro with the following?
Code: |
int a = 4;
int b = 5;
int c = changeSign(a + b); |
Dan.
#111293 - gmiller - Tue Dec 05, 2006 3:01 pm
Change your macro to:
Code: |
#define changeSign(x) (-(x))
|
To keep from messing up like poslundc is pointing out.
#111352 - Miked0801 - Tue Dec 05, 2006 7:47 pm
Quote: |
I have already looked at that. When I first looked at it I didn't understand it, until I realized that it is the same basic thing as mine only with unrolled loops.
|
It takes it a step beyond actually - it does a binary search based on the MSB to prevent a whole bunch of wasted time. It's faster on smaller divides, but a touch slower on larger number divides. I found it to be very clever.