gbadev.org forum archive

I am working on something where I need to keep only the first 2 digits of an unsigned decimal number. Here is what I mean:

99345 would become 99 (99345 / 1000)
2389454 would become 23 (2389454 / 100000)
1024 would become 10 (1024 / 100)

See what I mean?

Not really any multiple of ten but but you see the point...

Is there an algo (fast algo if possible) to do this? Right now, I use the reciprocal multiplication (converted to shifts) method to divide by multiples of ten to get the answer. But if there is a better method, I'd like to know. Thanks.

Code:

int firstDigits ( int i ) {
int j;
for ( j = 1; j <= i / 100; j = j * 10 ) {}
return i / j;
}

Statistics:
divides: 2
multiplies: log10 ( i / 10 )

My posprintf routine converts from base-2 to base-10, so you could then take the first two characters of the output string and subtract '0' to get the actual digits. It takes a fairly significant speed hit at numbers larger than 65,535, though, which I notice was the case with your second example. Not so much that it gets unmanageably slow (it entails calling one BIOS divide), but enough that someone else's routine on this board might be faster.

Dan.

@keldon:

I need something fast with no divides nor multiplies :)

@poslundc

My method is still faster requiring just a few shifts and adds for multiplying by the reciprocal of the divider. It's just that I thought it was a bit ugly because I have to test the number to process to see in what decimal range it is to determine by what I have to divide it.

isildur wrote:

My method is still faster requiring just a few shifts and adds for multiplying by the reciprocal of the divider. It's just that I thought it was a bit ugly because I have to test the number to process to see in what decimal range it is to determine by what I have to divide it.

You're kind of right; I should've looked closer at what exactly you were doing. At the same time, I suspect your way might not be faster for precisely the reason you've specified: you still have to test to determine what range it's in.

If the numbers are uniformly distributed, you could speed the tests up somewhat by exploiting that distribution. Say your numbers range from 0 to 99,999: start your testing at the point of greatest density in the distribution (x > 9,999) then work your elses down the line (x > 999, x > 99, x > 9).

Dan.

Or you could do a binary range search aswell, which is also quicker than a linear search.

[quote="poslundc"]

isildur wrote:

If the numbers are uniformly distributed, you could speed the tests up somewhat by exploiting that distribution. Say your numbers range from 0 to 99,999: start your testing at the point of greatest density in the distribution (x > 9,999) then work your elses down the line (x > 999, x > 99, x > 9).

Your reason for this may depend on what you need. If you want the 'best-possible worst-case' then you would (also) check for x>9,999 first as then only one comparison is required for the conversion that requires the most divides, and the conversion requiring many comparisons will require the least number of divides.

I'm not sure if this is interesting or not, and it probably isn't likely to be relevent to your problem. or perhaps it's just very obvious. :/

"unsigned decimal number"
What's the range of this number?

And what do you need this function for?

Here's my approach - I tested it with a variety of inputs and it seems okay. I've written this in C for ease reading, and debugging, but I think it would be trivial to convert it to assembler if you felt so inclined.

By my count, this is 9 compares, and a maximum of 6 adds, and six shifted subtracts.

Code:

/*************************************
function: first2
purpose: to return the first two digits (decimal representation) of an unsigned int
IMPORTANT LIMITATION: only works with values up to 999,999,999

Author: Neil McConnell
Date: Nov 3, 2004
*************************************/

#define tenPower0 1
#define tenPower1 10
#define tenPower2 100
#define tenPower3 1000
#define tenPower4 10000
#define tenPower5 100000
#define tenPower6 1000000
#define tenPower7 10000000
#define tenPower8 100000000

int first2(u32 x){
int retval=0;
u32 base;

if(x>=(tenPower5)){
if(x>=(tenPower7)){
if(x>=(tenPower8)){
base=tenPower7;
}
else{
base=tenPower6;
}
}
else{ //!(x>=(tenPower7)
if(x>=(tenPower6)){
base=tenPower5;
}
else{
base=tenPower4;
}
}
}
else{ //!(x>=(tenPower5)
if(x>=(tenPower3)){
if(x>=(tenPower4)){
base=tenPower3;
}
else{
base=tenPower2;
}
}
else{//(x>tenPower3)
if(x>=(tenPower2)){
base=tenPower1;
}
else{
base=tenPower0;
}
}
}

//the base is the smallest integer power of 10 such that (base * 100 > x)

if(x>= (base<<6)) {x-=base<<6; retval+=1<<6;}
if(x>= (base<<5)) {x-=base<<5; retval+=1<<5;}
if(x>= (base<<4)) {x-=base<<4; retval+=1<<4;}
if(x>= (base<<3)) {x-=base<<3; retval+=1<<3;}
if(x>= (base<<2)) {x-=base<<2; retval+=1<<2;}
if(x>= (base<<1)) {x-=base<<1; retval+=1<<1;}
if(x>= (base<<0)) {x-=base<<0; retval+=1<<0;}

return retval;

}

anyway, let me know if that works out for you

@ampz:

I am working on a fast aproximate square root function. I use a LUT with only the first 100 square roots and I aproximate the square roots for numbers higher than 100. So I need to keep the first 2 digits of a value because it becomes an index in my LUT. So basically, I need to know the decimal range of any 32 bit unsigned value passed.

Ex: if I need the square root of 349877, then I keep the first 2 digits (34) and get the value in the LUT at index 34. Then using this value, I aproximate the square root of 349877.

I get 3 times faster results than the asm square root code from Dooby which is available on my web site in the math section. I still have to convert it to arm assembly, so I guess it can still be a little faster.

Of course, I lose more precision with large numbers but I still get acceptable results (for my needs) with numbers under a million.

I use the square root function for 3d collision detection and vector length calculation. And since I need the fastest possible code for a square root, I realized that I could suffer a bit of precision loss without any impact.

@nmcconnell:

Thanks for posting that, I will test it later today. I will use your technique if it proves to be faster for my needs. I will tell you after I test it.

isildur wrote:

@ampz:

I am working on a fast aproximate square root function. I use a LUT with only the first 100 square roots and I aproximate the square roots for numbers higher than 100. So I need to keep the first 2 digits of a value because it becomes an index in my LUT. So basically, I need to know the decimal range of any 32 bit unsigned value passed.

I'm not clear on the mathematics behind your approximation, but would it not be possible to make a LUT for the first 128 square roots, and then instead of using the first two digits use the top 7 bits of the number? Break your number up into four blocks of 8 bits to determine which chunk to use, eg:

Code:

u32 n;
if (!(n = x >> 24))
if (!(n = ((x >> 16) & 0xFF)))
if(!(n = ((x >> 8) & 0xFF)))
n = x & 0xFF;

This way you don't need to worry about the decimal conversion at all.

Dan.

Hello,

I definitely agree with Dan on using a table of 128 instead of 100. If you're using a lookup table, then 128 will definitely be much simpler to accomplish.

There's a lot of very bright square root code located here:
http://www.azillionmonkeys.com/qed/sqroot.html
Search for "Update! Mark Cowne pointed out that unrolling Jim Ulery's solution yields a substantial increase in performance" and the code right after that should probably do for you. It's going to give you an integer square root very quickly.

If you are intent on using a lookup (which should be faster but less acurate than the prviously mentioned code) then I'd advocate doing a binary search to find magnitude, then do your lookup from there. I think that you will get better results from this than what you would get from breaking the number into arbitrary chunks.

PS - I'm still pretty pleased with my code for finding the first two decimal digits. However, while I suspect it's a fairly snappy solution to the originally stated problem, I don't think it will be remotely useful to your final square root code.

@poslundc:

I don't think I can do it that way. Here is my algo.

I noticed that square roots can be aproximated by multiplying a corresponding square root by a constant or, by multiplying by a corresponding square root by the decimal range of the value divided by 100, depending on the decimal range of the value.

Whew! I'm sure it's not very clear so here is an example.

Code:

PSEUDOCODE (not optimized, just the algo)

magicNumber = 3.162277
LUT = {square roots of numbers 0-100}
value = number for which we want the square root

if(value <= 100) return LUT[value]

if(value < 1000) return LUT[value / 10] * magicNumber

if(value < 10000) return LUT[value / 100] * 10

if(value < 100000) return LUT[value / 1000] * magicNumber * 10

if(value < 1000000) return LUT[value / 10000] * 100

if(value < 10000000) return LUT[value / 100000] * magicNumber * 100

if(value < 100000000) return LUT[value / 1000000] * 1000

if(value < 1000000000) return LUT[value / 10000000] * magicNumber * 1000

See what I mean?

This is just the base of the algo, I add a little precision by not using these constants exactly like that to average the precision loss within a given range. And of course, I convert the muls and divs to series of shifts and adds or subs. So it is quite fast. I will post the assembly version of it when it will be ready and fully tested.

I don't know if this square root aproximation method (or a similar) already exists but it looks like it works well for what I need it for.

@nmcconnell:

Well thanks for the link. I found one very good and very fast square root function there. It is a little faster than mine, but what I like about it is that it's a lot more precise than mine. So I guess I will not use mine now :)

I tried doing some random testing but I kept screwing up when inputting your formula on my calculator... :P

About what margin of error do you get with your formula?

Dan.

if you use my formula using floating point numbers, there isn't any significant loss of precision. In unsigned int form, the loss of precision ranges from 1 (when in 1000-9999 range) to 200 (when in 1000000-9999999 range). Loss increases when the decimal range increases, but I get perfect precision in the middle of each range. Worst precision is found at the start and end of the ranges.

Anyways, the implementation found on the link in nmcconnell's post above is far better. I had fun trying though :)

You don't necesarily need to give up yet. You could pretty certainly top the calculation code by using a lookup if you still want to try to squeeze out a little more speed.

Here's the real timesaver though:

For many 3D calculations, you don't need to use the square root function at all.

For example, I notice that you are using square root for colision detection. I'm assuming that you mean that you're checking the distance between two objects and then if the distance is small enough, you have a colision. This is to say that you're checking to see if:
sqrt(x*x+y*y+z*z) < collisionCutOff

It is generally prefereable to check:
x*x+y*y+z*z < collisionCutOff*collisionCutOff

This saves you the square root altogether. If you're getting overflow, you can shift both sides of the comparisson to the right.

------------------------------------------------

Anyway, with regard to your algorithm that uses all the 10s, base ten is very intuitive for humans because it is the number system we use most often. However, any algorithm that is explicitly tied to base 10 is really only useful for humans - computers generally prefer base 2 for exactly the same reasons humans prefer base 10 - because you can replace a multiply or a divide with a shift operation.

For example, if you asked me to aproximate the square root of 5,000,000 in my head, I'd start by taking out the million, and I'll say that that's six zeros, so the square root should get three zeros. Then I'd say that the sqaure root of 5 is 2 and spare change. This would give me an aproximate square root of 2,000 although I'd know that this is low. A good computer algorithm that uses a lookup will work almost exactly the same way, but in base 2 (binary) instead of base 10.

I hope this helps you out. Seriously, though, the most important thing you should remember is that there are many cases when you don't need to use square root at all. This is very important!

--Neil

In addition to this, if what you want is the distance between two cartesian points (and it's not sufficient just to check the squared versions) there's an excellent approximation function:

Code:

// distApprox adds 8 to the precision of the result.
// Therefore two integers will generate a 24.8 result.
// Expected error around 7%.

#define DA_FOS(n) (((n) << 6) + ((n) << 4))

inline s32 distApprox(s32 x, s32 y)
{
if (x < y)
return ((x << 8) + DA_FOS(y));
else
return ((y << 8) + DA_FOS(x));
}

Dan.

@nmcconnell:

Thanks for your tip on not using the square root. I will use it. :)

And I will try to use a 256 LUT instead for my algo, just for the fun of it and for the sake of completing what I had started. I always like experimenting with algos.

Also, I will convert the implementation from your link in arm assembly and post it on my site since it beats the square root algo from Dooby that I had posted.

Good call Dan. I think I remember that one from graphics gems or something.

Has anyone ever extended that approximation into 3-space? I'm just thinking about it now. . . what if you were do write a 3d dist function as

Code:

inline s32 distApprox3(s32 x, s32 y, s32z){
return(distApprox(distApprox(x, y), z<<8));
}

This would create a cumulative error of more than the 7% from the 2D algorithm. It should still be within 15% though, I think.

Anyone with more experience in this domain wanna comment? Could the error be reduced by processing the two smaller distances first?

--Neil

I just tried Dan's suggestion with the distance aproximation but it doesn't have enough precision for what I need. So, I now use nmcconnell's trick for collision detection. Still need 3 muls but I had 2 muls before plus the sqrt().

If you want to check the distance between two points on the GBA screen, then there is a 2-3 instruction way to do it:

lookup[y][x]

Or more optimal:

lookup[(y<<8) | x]

The above table takes 80kByte of flash.
Here is a version which trades one extra instruction in order to cut table size in half:

lookup[(y<<8) | x]+x

I am doing 3d object collision detection, though I need only to check in X and Z because the objects are sitting on the ground.

ahh. detecting in 2 dimensions explains your comment:

Quote:

I just tried Dan's suggestion with the distance aproximation but it doesn't have enough precision for what I need. So, I now use nmcconnell's trick for collision detection. Still need 3 muls but I had 2 muls before plus the sqrt().

However, don't forget that if your collision threshold distance is the same every time, then you only need to do that multiplication once and store it. This means that you're almost back down to just two multiplications per collision check.

There are some other things you can do that can save you time in a collision detection system. It sort of depends on how you've got things set up, but in general, you should not be looking at a scenario where you just iterate through each object and check its distance from every other object. This is an O(n^2) algorithm, and can usually (always?) be improved upon.

I think there must be extensive posting on the topic of optimal collision detection on this board and others. I haven't actually checked, but it's worth looking in to.

nmcconnell wrote:

ahh. detecting in 2 dimensions explains your comment:

However, don't forget that if your collision threshold distance is the same every time, then you only need to do that multiplication once and store it. This means that you're almost back down to just two multiplications per collision check.

Yes, I will store that value in the object's structure.

nmcconnell wrote:

There are some other things you can do that can save you time in a collision detection system. It sort of depends on how you've got things set up, but in general, you should not be looking at a scenario where you just iterate through each object and check its distance from every other object. This is an O(n^2) algorithm, and can usually (always?) be improved upon.

In my game, most of the objects are static, I only need to check on 4 moving objects.

Then why the big fuzz about speed?

Quote:

Then why the big fuzz about speed?

Well, why not?

I am doing 3d in mode 4. I have only 4 moving objects but I have to check if they collide with all the static objects.

I am aiming for 60 fps and at least 30 fps when there's lots of stuff to render. So, I have to optimize every bit of code I can to keep this frame rate. This alone might not make a huge difference but add up all the unoptimized math code and then it make a difference. And I just hate jerky 3d.

And anyways, I like optimizing routines like that. I like having the fastest code for 3d and math stuff.
[/quote]

Thought I'd chime in here, since square-root is a problem I've tackled before. In the event that you really do need to compute square roots, such as you might require for normalizing vectors, there are a variety of efficient mathematical techniques that exist, ranging from the silly (such as odd-number sums) to the very sophisticated (bit-wise logarithmic tricks). Paul Hseih has tracked these algorithms at his square-root web site: http://www.azillionmonkeys.com/qed/sqroot.html

For what it's worth, I've been using Mark Cowne's unrolled version of Jim Ulery's fast square-root algorithm for fixed-point vector-normalization, and it's been quite serviceable for that. No doubt it would work even better with a tightly-coded assembly version, but I haven't actually needed that much speed.

Of course, square root is a relatively easy problem... now performing pow(x,y) in fixed-point --- efficiently --- now that's tough (see Knuth 1.2.2.25 for hints...).
_________________
Do you suppose if I put a signature here, anyone would read it? No? I didn't think so either.

The link you posted was already posted above, and that's the square root code I'm using now. Though, I don't need the square root anymore for collision detection.

isildur wrote:

The link you posted was already posted above, and that's the square root code I'm using now. Though, I don't need the square root anymore for collision detection.

Ah, indeed. I missed that the first time reading through this thread. Ah, well, better too much information than too little. And you may still want to retain that link: If you're doing heavy 3-D stuff, you're going to eventually want a routine for vector normalization, and that means square root.
_________________
Do you suppose if I put a signature here, anyone would read it? No? I didn't think so either.

Well you can multiply by 10 by doing things like (x<<1)+(x<<3) so im sure there is some way to divide, but I can't see how immediately. I will have a think about it at work today.
_________________
www.hungrydeveloper.com
Version 2.0 now up - guaranteed at least 100% more pleasing!

To divide by 10 you could do this:

Code:

Since we know that:
x / 10 = 0.1 * 10

So you precompute 0.1 * 1024 = 102.4, but you round it to 103.

Then it makes (x * 103) >> 10 = x / 10

So you can finally do this:

Turn (x*103) in shifts: ((x << 6) + (x << 5) + (x << 2) + (x << 1) + x) >> 10 = x / 10

Dear god, just do a multiply and be done with it! It's only 2-5 cycles :)

Oh really? :-0

LOL

I thought those multiplies took more cycles than that. Oh well...

If you're coding in assembler, the size of the third parameter dictates how long it takes. If there are 1-8 significant bits (positive or negative) then it's only two cycles, and as the number of significant bits increase (9-16, 17-24, 25-32) the cycle count increases to five.

In assembly you can often do some mega-optimizations if you know the domain restrictions on your inputs. I don't think GCC is smart enough to do those kinds of optimizations, though, even when obvious.

The fast multiply opcode does eliminate the need to do most complicated shift-multiplies. Simple ones may still be worthwhile, since you still need to load the multiplicand into a register and this is non-trivial if it can't be expressed as a shifted 8-bit constant.

Dan.

isildur wrote:

I am working on something where I need to keep only the first 2 digits of an unsigned decimal number. Here is what I mean:

99345 would become 99 (99345 / 1000)
2389454 would become 23 (2389454 / 100000)
1024 would become 10 (1024 / 100)

See what I mean?

Not really any multiple of ten but but you see the point...

Is there an algo (fast algo if possible) to do this? Right now, I use the reciprocal multiplication (converted to shifts) method to divide by multiples of ten to get the answer. But if there is a better method, I'd like to know. Thanks.

'A simple solution'
First 'normalise' your number so multiply by 10 until it is greater or equal to 100000000UL (beware of the special case 0)
(Which is the largest multiple of 10 that will fit into 32 bits)

U32 d0=0,d1=0;

U32 n=666; // your number to extrat the digits from (unsigned 32 bits)

while (n<100000000UL && n!=0) n*=10; // 'normalise' your number

Then your first digit is :-

if (n>=500000000UL) {d0+=5;n-=500000000UL;}
if (n>=200000000UL) {d0+=2;n-=200000000UL;}
if (n>=100000000UL) {d0+=1;n-=100000000UL;}
if (n>=100000000UL) {d0+=1;n-=100000000UL;}

Now extract your next digit

n*=10; // next digit

if (n>=500000000UL) {d1+=5;n-=500000000UL;}
if (n>=200000000UL) {d1+=2;n-=200000000UL;}
if (n>=100000000UL) {d1+=1;n-=100000000UL;}
if (n>=100000000UL) {d1+=1;n-=100000000UL;}

d0 & d1 are your 2 leading digits 0..9.

Not the fastest way of doing it/doesn't involve tables/fixup etc.

(N.B. I've not tested this, so be it on your own head!)

AnthC wrote:

while (n<100000000UL && n!=0) n*=10; // 'normalise' your number

That's just something I wanted to avoid :-)

isildur wrote:

AnthC wrote:

while (n<100000000UL && n!=0) n*=10; // 'normalise' your number

That's just something I wanted to avoid :-)

Well there's ways around that, you could split it into a binary search, although I don't see the while loop taking up _that_ much CPU.
(Mul's tend to be cheap on the GBA's CPU)

gbadev.org forum archive

Coding > Division by multiples of 10...

#28558 - isildur - Tue Nov 02, 2004 9:15 pm

#28561 - keldon - Tue Nov 02, 2004 9:29 pm

#28564 - poslundc - Tue Nov 02, 2004 9:43 pm

#28572 - isildur - Tue Nov 02, 2004 10:41 pm

#28574 - poslundc - Tue Nov 02, 2004 10:55 pm

#28575 - keldon - Tue Nov 02, 2004 11:03 pm

#28579 - pyros - Tue Nov 02, 2004 11:31 pm

#28649 - ampz - Wed Nov 03, 2004 11:39 pm

#28653 - nmcconnell - Thu Nov 04, 2004 1:58 am

#28692 - isildur - Thu Nov 04, 2004 4:43 pm

#28693 - poslundc - Thu Nov 04, 2004 5:08 pm

#28694 - nmcconnell - Thu Nov 04, 2004 6:08 pm

#28697 - isildur - Thu Nov 04, 2004 6:33 pm

#28699 - isildur - Thu Nov 04, 2004 6:54 pm

#28701 - poslundc - Thu Nov 04, 2004 7:24 pm

#28702 - isildur - Thu Nov 04, 2004 8:13 pm

#28703 - nmcconnell - Thu Nov 04, 2004 8:35 pm

#28704 - poslundc - Thu Nov 04, 2004 8:48 pm

#28705 - isildur - Thu Nov 04, 2004 9:00 pm

#28707 - nmcconnell - Thu Nov 04, 2004 9:09 pm

#28708 - isildur - Thu Nov 04, 2004 9:30 pm

#28711 - ampz - Thu Nov 04, 2004 10:23 pm

#28713 - isildur - Thu Nov 04, 2004 10:45 pm

#28715 - nmcconnell - Thu Nov 04, 2004 10:56 pm

#28724 - isildur - Fri Nov 05, 2004 2:44 am

#28742 - ampz - Fri Nov 05, 2004 7:18 am

#28752 - isildur - Fri Nov 05, 2004 3:47 pm

#28807 - phantom-inker - Sat Nov 06, 2004 8:13 pm

#28820 - isildur - Sun Nov 07, 2004 1:40 am

#29050 - phantom-inker - Wed Nov 10, 2004 8:18 pm

#29061 - MumblyJoe - Wed Nov 10, 2004 10:40 pm

#29066 - isildur - Wed Nov 10, 2004 10:57 pm

#29111 - Miked0801 - Thu Nov 11, 2004 7:07 pm

#29112 - isildur - Thu Nov 11, 2004 7:18 pm

#29114 - poslundc - Thu Nov 11, 2004 7:32 pm

#29360 - AnthC - Fri Nov 19, 2004 1:04 am

#29361 - isildur - Fri Nov 19, 2004 1:40 am

#29946 - AnthC - Sat Nov 27, 2004 1:33 am