gbadev.org forum archive

Okay, last time I wanted help, it was all about fixed point math and divide. People helped me to see how to construct fixed point fractions.

Now I have run into another problem. I need to construct bitshifts to replace multiplication.

Here are two examples...

Using slow multiplication:

Code:

1 * 12 = 12
4 * 12 = 48

And using bitshifts and additions instead:

Code:

((1 << 2) + ((1 << 2) << 1)) = 12
((4 << 2) + ((4 << 2) << 1)) = 48

Using slow multiplication:

Code:

1 * 1152 = 1152
4 * 1152 = 4608

And using bitshifts and additions instead:

Code:

((1 << 7) + ((1 << 7) << 3)) = 1152
((4 << 7) + ((4 << 7) << 3)) = 4608

Now, is there any technique I should use to come up with these bitwise/addition formulas?

I want to be able to do this in bitwise/addition formula:

Code:

1 * 14 = 14
4 * 14 = 56

How do I do? How must I think?
_________________
Exceptions are fun

I would assume the slow way is something like

Code:

int multiply(int operand1, int operand2)
{
int i,j=0;
for(i=0;i<operand1;i++)
j+=operand2;
return j;
}

As such, The faster multiplication, using shift + adds, follows something like this.

Code:

int multiply(int operand1, int operand2)
{
int i=operand1, j=operand2,k=0;

while(i>0)
{
if(i&1)
{
k+=j;
j<<=1;
i>>=1;
}
}

return k;

return j;
}

No no! No while loops!

1 and 4 is the changing factor. The variable that is. The other factor is replaced with bitshifts and addition of the changing factor. You have two changing factors which I am not looking for.

I can try myself by testing...

Code:

// 1 * 14 = 14

a = 1;
a = a << 2; // 4
b = a << 1; // 8
c = a >> 1; // 2
result = a + b + c; // 4 + 8 + 2 = 14

But it took me time to come up with this! How can I come up with this easier? Is there a 14 / 2... 14 / 4... 14 / 8 test method of any kind?
_________________
Exceptions are fun

If this is targetting the gba or ds, multiplies are pretty darn quick. Especially when the high 24 or 16 bits of one of the terms are the same (which they would be in your examples).

High 24 bits same = 1 cycle
High 16 bits same = 2 cycles
High 8 bits same = 3 cycles
Default = 4 cycles

mla always adds one cycle.

If you really feel the need to bust out the shifts, the easiest way to calculate the shifts in general is probably to write a very simple search program to do it for you. Use some heuristics, while there may be a greedy method that works I can't prove an easy one to myself right now :). In general it sounds like a class of problems known as dynamic programming, where you can solve a larger problem with stored information about a bunch of equivalent smaller versions of the problem. There might be some smart way to solve it using the prime factorization of the multiplier as well.

gladius wrote:

High 24 bits same = 1 cycle
High 16 bits same = 2 cycles
High 8 bits same = 3 cycles
Default = 4 cycles

Sorry to be pedantic about this, but this is true for IWRAM, not in general. The proper timing is 1S+mI for MUL and 1S+(m+1)I for MLA. 1S is a sequential access and I an internal access. I always takes 1 cycle, but S is subject to waitstates and takes 1 cycle in IWRAM, 3 in EWRAM or ROM (ignoring waitstate settings for the moment). Thumb shifts/adds/subs are 1S as well, so in the slower sections a simple mul will often be faster. ARM/IWRAM changes things a little because the S/I ratio is different and you can combine add/sub with a shift, but it's still true that you shouldn't could muls out so fast.

The m is the number of significant bytes of the second argument in the assembly instruction. I don't know if GCC automatically transfers the one with the least amount of bits to the second arg.

GBATek, multiply. PBU Martin.

PS: 14 = 16-2, or (8-1)<<1. The key is the number of bit clusters.

Cearn wrote:

14 = 16-2, or (8-1)<<1. The key is the number of bit clusters.

I have a factor to multiply with. And the factor changes. Therefore there is only one way to move the bits in order to get the same result for any factor.

If you need to know what i am using this for, I am using this method for a jump table, and for data access, and for scroll offsets. And I have run out of cycles this time. I have no choice or else my Vblank will not work. So please help me!
_________________
Exceptions are fun

Cearn wrote:

Sorry to be pedantic about this, but this is true for IWRAM, not in general. The proper timing is 1S+mI for MUL and 1S+(m+1)I for MLA. 1S is a sequential access and I an internal access. I always takes 1 cycle, but S is subject to waitstates and takes 1 cycle in IWRAM, 3 in EWRAM or ROM (ignoring waitstate settings for the moment). Thumb shifts/adds/subs are 1S as well, so in the slower sections a simple mul will often be faster. ARM/IWRAM changes things a little because the S/I ratio is different and you can combine add/sub with a shift, but it's still true that you shouldn't could muls out so fast.

as long as you're doing the adds the sequential access goes for it too, so his point is perfectly valid.

kusma wrote:

Cearn wrote:

Sorry to be pedantic about this, but this is true for IWRAM, not in general. The proper timing is 1S+mI for MUL and 1S+(m+1)I for MLA. 1S is a sequential access and I an internal access. I always takes 1 cycle, but S is subject to waitstates and takes 1 cycle in IWRAM, 3 in EWRAM or ROM (ignoring waitstate settings for the moment). Thumb shifts/adds/subs are 1S as well, so in the slower sections a simple mul will often be faster. ARM/IWRAM changes things a little because the S/I ratio is different and you can combine add/sub with a shift, but it's still true that you shouldn't could muls out so fast.

as long as you're doing the adds the sequential access goes for it too, so his point is perfectly valid.

What I meant was that the break-even point between MUL and ADD/SUB+LSL depends on the section you're in. Take the worst case scenerio for a MUL: 1S+4I.

Code:

| mul | can be used for x other instructions
ROM | 7 | 7/3 ~ 2
IWRAM | 5 | 5/1 = 5

In ROM, a MUL will be faster if you need more than 2 instructions to compensate, in IWRAM you're allowed up to 5. Mind you, this is for the worst case for MUL. Well, almost;:you still have to load multiplicants to the registers, but looking at the calculation alone, this should be right.

I fully agree that GBA/DS MULs can be quick, but there are nuances that should not be overlooked so easily.

Lost? wrote:

I have a factor to multiply with. And the factor changes. Therefore there is only one way to move the bits in order to get the same result for any factor.

If you need to know what i am using this for, I am using this method for a jump table, and for data access, and for scroll offsets. And I have run out of cycles this time. I have no choice or else my Vblank will not work. So please help me!

I'm not sure I understand ... you're trying to replace multiplications with shifts when both arguments are variable? When one of them is constant, it's easy:

Code:

// y= x*12= x*(4-1)*4
rsb y, x, x, lsl #2
mov y, y, lsl #2

// y= x*14= x*(8-1)*2
rsb y, x, x, lsl #3
mov y, y, lsl #1

// y= x*1152= x*(8+1)*128
add y, x, x, lsl #3
mov y, y, lsl #7

but when both are variable, that would be tricky.

Just for the record, are you already in working in ARM asm and putting it in IWRAM? I'll probably get a "pfft, duh!" thrown at me, but I just want to make sure.

And just for the record, optimizing 1-3 cycles for a multiply call is silly in just about any case you can think of. If you need this optimization, make sure that the terms are ordered such that you get savings that way.

My post was made assuming iwram/arm (or cached inst on the ds). You have to be careful with cycle analysis on paper. It usually doesn't turn out the way one would expect.

In any case, I should have posted this advice instead: You should probably look at your algorithmic complexity before getting down to the cycle. Unless you are doing on the order of a few hundred thousand of these multiplies a second, it probably doesn't make too much of a difference which method you use.

To answer your question, it's as easy as singeling out the bits that are set for the value that you want to multiply with.

For example: 1152 in binary is 10010000000

That means bit 10 and bit 7 is set (counting the least significant bit as bit 0), so you want to shift your variable up twice. First 10 bits and then 7 bits, then add those two shift results together.

But yeah, it's already been pointed out that this is probably not the thing you want to optimize firstly if you want to make your code quicker.
_________________
You can't beat our meat

Touchstone: That method makes an excellent starting point, but it is not optimal. Consider any term 2^x - 1, so all the bits will be set. It's better to just do (a << x) - a, as opposed to (a << 0) + (a << 1) + ... (a << x - 1). For example, multiplying by 255 could be done by a * 256 - a optimally, while the other method requires 8 shifts (8 bits set).

It's actually a pretty interesting algorithmic problem, although I'm sure some compiler genius came up with a great way to do it.

Hmm, extending Touchstone's idea, use the (1<<x) - 1 pattern to get groups of bits at a time:

Code:

a * 0111 0110 (118 decimal)

temp = (a << 3) - a;rsb temp, a, a, lsl #3 ;(a * 0000 0111)
temp += temp << 4 ;add temp,temp,temp,lsl #4;(a * 0111 0111)
a = temp - a ;sub a, temp, a ;(a * 0111 0110)

Or better yet,

Code:

temp = a + (a << 3);add temp, a, a, lsl #3 ;(a * 0000 1001)
a = (a << 7) - temp;rsb a, temp, a, lsl #7 ;(a * 0111 0110)

Still no sure-fire algorithm, but better than straight trial-and-error.

I'll try a big one

Code:

a * 1010 0111 0110 1001 (42857 decimal)

add temp, a, a, lsl #2 (a * 0000 0000 0000 0101)
add tmp2, temp, temp, lsl #3 (a * 0000 0000 0010 1101)
add tmp2, tmp2, temp, lsl #10 (a * 0001 0100 0010 1101)
add tmp2, tmp2, a, lsl #6 (a * 0001 0100 0110 1101)
add tmp2, tmp2, a, lsl #7 (a * 0001 0100 1110 1101)
add a, a, tmp2, lsl #3 (a * 1010 0111 0110 1001)

Ouch :)
_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

DekuTree64: Yes, that's a quite efficient greedy method. It breaks down when grouping doesn't work well though. I finally broke down and wrote a (way too complex probably :P) searcher for this and it came up with some interesting results.

For example, 45 = 101101 in binary. There is this quite elegant two instruction multiply:
b = a + a << 3
a = b + b << 4

Your 42857 example the best I could find was 6 steps (although different steps).
[edit: actually it found a 5 step path:
42857 = 75753 - 257 << 7
75753 = 8417 + 8417 << 3
8417 = 8481 - 1 << 6
8481 = 257 + 257 << 5
257 = 1 + 1 << 8
]

An interesting one is 43690 (or 0xAAAA, 1010101010101010b). It is solved in 4 steps with:
a = a + a << 8
b = a << 7;
a = b + a << 5;
result = a + a >> 4

The (absolutely a complete mess code):

Code:

#include <vector>
#include <queue>
#include <map>
#include <cstdio>
#include <cstdlib>
using namespace std;

struct node {
node() {}
node(int _num,int _size,int _prev,int _op) : number(_num), size(_size), prev(_prev),op(_op) { }
int number, size, prev, op;
};
bool operator<(const node &n1, const node &n2) {
if (n1.size != n2.size) return n1.size > n2.size;
if (n1.number != n2.number) return n1.number < n2.number;
if (n1.prev != n2.prev) return n1.prev < n2.prev;
return n1.op < n2.op;
}

vector<bool> hit;
map<int, node> cache;
priority_queue<node> q;

int log2int(int num) {
int log2 = 0;
for (;num;log2++) { num &= num - 1; }
return log2;
}
void pushcheck(priority_queue<node> &q, node &n) {
if (n.number < 0) return;
if (n.number >= hit.size()) return;
if (!hit[n.number]) q.push(n);
}
int main() {
const int target = 43690;
int log2target = log2int(target)+1;
q.push(node(1, 1, -1, 0));
hit = vector<bool>(target * 5, false);
while (!q.empty()) {
node top = q.top(); q.pop();
if (top.size > 6) continue;
if (top.number > target * 4) continue;

if (top.number == target) {
// result
printf("Result: \n");
for(;;) {
int data = top.op >> 4;
int tmp = abs(top.number - top.prev);
switch (top.op & 0xf) {
case 1: printf("%d = %d << %d\n", top.number, top.prev, data); break;
case 2: printf("%d = %d + %d << %d\n", top.number, top.prev, tmp >> data, data); break;
case 3: printf("%d = %d - %d << %d\n", top.number, top.prev, tmp >> data, data); break;
case 4: printf("%d = %d + %d >> %d\n", top.number, top.prev, tmp << data, data); break;
case 5: printf("%d = %d - %d >> %d\n", top.number, top.prev, tmp << data, data); break;
}
if (top.prev == -1) break;
top = cache[top.prev];
}
break;
}

if (hit[top.number]) continue;
hit[top.number] = true;
cache[top.number] = top;

node cur = top;
for (int i = 0; i < log2target; i++) {
pushcheck(q,node(top.number << i, top.size + 1, top.number, 1 | (i << 4)));
}
for(;;) {
for (int i = 0; i < log2target; i++) {
pushcheck(q,node(top.number + (cur.number << i), top.size + 1, top.number, 2 | (i << 4)));
pushcheck(q,node(top.number - (cur.number << i), top.size + 1, top.number, 3 | (i << 4)));
}
for (int i = 1,j; (j=(cur.number>>i)) > 0; i++) {
pushcheck(q,node(top.number + j, top.size + 1, top.number, 4 | (i << 4)));
pushcheck(q,node(top.number - j, top.size + 1, top.number, 5 | (i << 4)));
}
if (cur.prev == -1) break;
cur = cache[cur.prev];
}

}
return 0;
}

gladius, I have never seen anything like this before. Your code uses C++ stuff I never new existed.

It looks like it's working. I got a lot of warnings when I compiled it in VC++ 6.

EDIT: Your program goes really fast with numbers up to 511, and it hangs with numbers 512 and higher.

Quote:

Compiling...
mulbit.cpp
D:\Programming\mulbit\mulbit.cpp(24) : warning C4786: 'std::reverse_bidirectional_iterator<std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::iterator,std::p
air<int const ,node>,std::pair<int const ,node> &,std::pair<int const ,node> *,int>' : identifier was truncated to '255' characters in the debug information
D:\Programming\mulbit\mulbit.cpp(24) : warning C4786: 'std::reverse_bidirectional_iterator<std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::const_iterator,
std::pair<int const ,node>,std::pair<int const ,node> const &,std::pair<int const ,node> const *,int>' : identifier was truncated to '255' characters in the debug information
D:\Programming\mulbit\mulbit.cpp(24) : warning C4786: 'std::pair<std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::iterator,std::_Tree<int,std::pair<int con
st ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::iterator>' : identifier was truncated to '255' characters in the debug information
D:\Programming\mulbit\mulbit.cpp(24) : warning C4786: 'std::pair<std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::const_iterator,std::_Tree<int,std::pair<i
nt const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::const_iterator>' : identifier was truncated to '255' characters in the debug information
c:\microsoft visual studio\vc98\include\xtree(183) : warning C4786: 'std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::~_Tree<int,std::pair<int const ,node>
,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >' : identifier was truncated to '255' characters in the debug information
c:\microsoft visual studio\vc98\include\xtree(160) : warning C4786: 'std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::_Tree<int,std::pair<int const ,node>,
std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >' : identifier was truncated to '255' characters in the debug information
c:\microsoft visual studio\vc98\include\utility(21) : warning C4786: 'std::pair<std::_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::iterator,bool>::pair<std::
_Tree<int,std::pair<int const ,node>,std::map<int,node,std::less<int>,std::allocator<node> >::_Kfn,std::less<int>,std::allocator<node> >::iterator,bool>' : identifier was truncated to '255' characters in the debug information

mulbit.obj - 0 error(s), 7 warning(s)

_________________
Exceptions are fun

Some of that stuff you've never seen before is probably templates. Simple tutorial here: http://www.cplusplus.com/doc/tutorial/templates.html.

It (ab)uses the STL, so if you aren't familiar with that it would look quite strange.

It's not freezing on numbers higher than 512 here, but my log2int function was completely wrong. It counted the number of set bits.

The correct function looks like this:

Code:

int log2int(int num) {
int log2 = 0;
for (;num;log2++) { num = num & (~(1<<log2)); }
return log2;
}

Using that function I actually improved the old result for 42857, it's now:
42857 = 43537 - 43537 >> 6
43537 = 2561 + 2561 << 4
2561 = 2049 + 1 << 9
2049 = 1 + 1 << 11

It also handles 512 fine now, giving what one would expect:
512 = 1 << 9.

Nice to see some actual code (although I don't really understand how it works :/ )

gladius wrote:

Using that function I actually improved the old result for 42857, it's now:
42857 = 43537 - 43537 >> 6
43537 = 2561 + 2561 << 4
2561 = 2049 + 1 << 9
2049 = 1 + 1 << 11

Hmm, that doesn't quite work, due to the shift-down. Any bits that are shifted off must be 0, otherwise they will still contribute to the result because the other number has been multiplied by them already.
For example, 42857 * 10 should be 428570, but

Code:

10 + 10 << 11 = 10 + 20480 = 20490
20490 + 10 << 9 = 20490 + 5120 = 25610
25610 + 25610 << 4 = 25610 + 409760 = 435370
435370 - 435370 >> 6 = 435370 - 6802 = 428568

_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

Code:

1125 = 1285 - 1280 >> 3
1285 = 257 + 257 << 2
257 = 1 + 1 << 8

((((x + (x << 8)) + (((x + (x << 8)) << 2)) - (1280 >> 3)) = x * 1125

I like your program and it seems to work. But the constant "- (1280 >> 3)" scares me. Is there any other way to get this number with just bitshifting x? For some reason I want addition only because subtraction is addition of two's complement notation, which doesn't seem like a shortcut to me.
_________________
Exceptions are fun

DekuTree64: That's a great observation, and I didn't even consider it. Those shifts to the right are busted. Back to a 5 step multiplication without that then.

42857 = 43113 - 1 << 8
43113 = 34901 + 2053 << 2
34901 = 2053 + 2053 << 4
2053 = 2049 + 1 << 2
2049 = 1 + 1 << 11

Lost: That 1280 is actually 1285, but the way I coded it, it was tough to reconstruct the number we built from, so it generated (1285 >> 3) << 3 to print, hence chopping of the bottom 3 bits and giving you 1280. As DekuTree64 pointed out, that is not correct.

Using the updated program, the best result is

1125 = 225 + 225 << 2
225 = 257 - 1 << 5
257 = 1 + 1 << 8

So, still 3 steps, and this one actually works :).

This algorithm is basically trying to find the shortest path in a graph, where the graph's nodes are integers, and the connections between the nodes are a + b << n of some sort.

It uses a pretty standard priority queue implementation of a Dijkstra shortest-path search to give O(m logn) performance where m is the number of edges, and n is the number of nodes.

Updated program below (pretty small change):

Code:

#include <vector>
#include <queue>
#include <map>
#include <cstdio>
#include <cstdlib>
using namespace std;

struct node {
node() {}
node(int _num,int _size,int _prev,int _op) : number(_num), size(_size), prev(_prev),op(_op) { }
int number, size, prev, op;
};
bool operator<(const node &n1, const node &n2) {
if (n1.size != n2.size) return n1.size > n2.size;
if (n1.number != n2.number) return n1.number < n2.number;
if (n1.prev != n2.prev) return n1.prev < n2.prev;
return n1.op < n2.op;
}

vector<bool> hit;
map<int, node> cache;
priority_queue<node> q;

int log2int(int num) {
int log2 = 0;
for (;num;log2++) { num = num & (~(1<<log2)); }
return log2;
}
void pushcheck(priority_queue<node> &q, node &n) {
if (n.number < 0) return;
if (n.number >= hit.size()) return;
if (!hit[n.number]) q.push(n);
}
int main() {
const int target = 1125;
int log2target = log2int(target)+1;
q.push(node(1, 1, -1, 0));
hit = vector<bool>(target * 5, false);
while (!q.empty()) {
node top = q.top(); q.pop();
if (top.size > 6) continue;
if (top.number > target * 4) continue;

if (top.number == target) {
// result
printf("Result: \n");
for(;;) {
int data = top.op >> 4;
int tmp = abs(top.number - top.prev);
switch (top.op & 0xf) {
case 1: printf("%d = %d << %d\n", top.number, top.prev, data); break;
case 2: printf("%d = %d + %d << %d\n", top.number, top.prev, tmp >> data, data); break;
case 3: printf("%d = %d - %d << %d\n", top.number, top.prev, tmp >> data, data); break;
case 4: printf("%d = %d + %d >> %d\n", top.number, top.prev, tmp << data, data); break;
case 5: printf("%d = %d - %d >> %d\n", top.number, top.prev, tmp << data, data); break;
}
if (top.prev == -1) break;
top = cache[top.prev];
}
break;
}

if (hit[top.number]) continue;
hit[top.number] = true;
cache[top.number] = top;

node cur = top;
for (int i = 0; i < log2target; i++) {
pushcheck(q,node(top.number << i, top.size + 1, top.number, 1 | (i << 4)));
}
for(;;) {
for (int i = 0; i < log2target; i++) {
pushcheck(q,node(top.number + (cur.number << i), top.size + 1, top.number, 2 | (i << 4)));
pushcheck(q,node(top.number - (cur.number << i), top.size + 1, top.number, 3 | (i << 4)));
}
for (int i = 1,j; (j=(cur.number>>i)) > 0; i++) {
if ((j << i) != cur.number) break;
pushcheck(q,node(top.number + j, top.size + 1, top.number, 4 | (i << 4)));
pushcheck(q,node(top.number - j, top.size + 1, top.number, 5 | (i << 4)));
}
if (cur.prev == -1) break;
cur = cache[cur.prev];
}

}
return 0;
}

gbadev.org forum archive

Coding > Help making up multiplication with bitwise shifts!

#63441 - LOst? - Tue Dec 13, 2005 4:57 am

#63444 - caitsith2 - Tue Dec 13, 2005 5:19 am

#63445 - LOst? - Tue Dec 13, 2005 5:28 am

#63457 - gladius - Tue Dec 13, 2005 8:28 am

#63461 - Cearn - Tue Dec 13, 2005 10:16 am

#63464 - LOst? - Tue Dec 13, 2005 10:47 am

#63470 - kusma - Tue Dec 13, 2005 12:04 pm

#63474 - Cearn - Tue Dec 13, 2005 1:43 pm

#63502 - Miked0801 - Tue Dec 13, 2005 8:25 pm

#63505 - gladius - Tue Dec 13, 2005 8:37 pm

#63531 - Touchstone - Wed Dec 14, 2005 1:36 am

#63559 - gladius - Wed Dec 14, 2005 7:32 am

#63563 - DekuTree64 - Wed Dec 14, 2005 9:01 am

#63570 - gladius - Wed Dec 14, 2005 11:23 am

#63766 - LOst? - Fri Dec 16, 2005 4:54 am

#63773 - gauauu - Fri Dec 16, 2005 5:43 am

#63775 - gladius - Fri Dec 16, 2005 6:25 am

#63786 - DekuTree64 - Fri Dec 16, 2005 7:57 am

#63787 - LOst? - Fri Dec 16, 2005 8:06 am

#63799 - gladius - Fri Dec 16, 2005 11:18 am