gbadev.org forum archive

I'm writing a GBA game in simple C++ with devkitpro. After a great deal of effort, I've gotten a simple demo of the first room working and playable. The problem is that it is way to slow. As far as I can tell, my update function takes 1-2 frames to update, meaning that it only runs at half speed. So I need to at least double the speed in order to get it to run normally.

So now I'm trying to figure out how to optimize it. Unfortunately, I'm a beginner and have no idea where to start. One possible step I discovered is attempt to put more things in IWRAM. Unfortunately, I know nothing about linker scripts and I'm not sure how to find out what things are currently placed where, let alone how to optimize the section placement.

By the way, no I don't use exceptions, or rtti, or floating point, or virtual functions, or dynamic memory, or the standard library. So there aren't any really obvious speedups like that.

Also, I know that my physics code is the main bottleneck, so I don't need to worry about graphics (and I use DMA for that anyway). When I comment out parts of the physics code, my game runs at full speed.

Also, here are my current compile options
-g -Wall -O2 -mcpu=arm7tdmi -mtune=arm7tdmi -fomit-frame-pointer -ffast-math -mthumb -mthumb-interwork -fno-rtti -fno-exceptions -std=c++0x

It is not possible to offer advice without more information. code would be best, or at a minimum a description of the algorithm you are using.

What elhobbs said is true, but also consider that it isn't necessarily bad to actually run at 30 fps instead of 60 -- miked says they often do that in commercial games, even.

If you can run at a solid 30, and make everything move twice as fast as what you had before, it's often "good enough" and gives you tons more time for processing each frame.

The problem is that I'm not sure what code to post.

Here's my collision detection code. It's probably one of the most performance hogging parts of the game.

Fixed<E>, Vec<E>, and UnitVec are part of the fixed point library I wrote. Fixed is a single fixed point number (32bit), where E is the shift. So Fixed<20> represents a 11.20 number, which is used for all coordinates in my game. Vec is a 2d vector. UnitVec is a special class for axis aligned unit vectors.
The named functions (Add, Mult, Div, etc.) use a 64bit intermediate value for better accuracy. The operators (+, -) just call the corresponding operator on the values directly without worrying about rounding or overflow.

Code:

#pragma once
#include "array.h"

template <class T, class DirType> //Should be shape types
Fixed<20> PenetrationAlongDir(const DirType& N, const Fixed<20> baseN, const Vec<20>& centerOffset, const T& shapeB)
{
auto minB = Fixed<20>::MAX();

for(u32 i=0; i<4; ++i)
{
auto newB = Dot<20>(N, centerOffset + shapeB.GetPoint(i));
minB = Min(minB, newB);
}

return baseN - minB;
}

template <class T1, class T2, class DirType>
void GetPenetrationSAT(DirType& bestNorm, Fixed<20>& bestPen, const Vec<20>& dist /*centerOffset*/, const T1& shapeA, const T2& shapeB)
{
bestPen = Fixed<20>::MAX();

for(u32 i=0; i<4; ++i)
{
const auto N = shapeB.GetSidePerp(i);
const auto baseN = Dot<20>(shapeB.GetPoint(i), N);
const auto newPen = PenetrationAlongDir(N, baseN, -dist, shapeA);

if (newPen < bestPen)
{
bestPen = newPen;
bestNorm = N;
if (newPen.val < 0) {return;}
}
}

for(u32 i=0; i<4; ++i)
{
const auto N = shapeA.GetSidePerp(i);
const auto baseN = Dot<20>(shapeA.GetPoint(i), N);
const auto newPen = PenetrationAlongDir(N, baseN, dist, shapeB);

if (newPen < bestPen)
{
bestPen = newPen;
bestNorm = -N;
if (newPen.val < 0) {return;}
}
}
}

//Will return pt2 if it is exactly on the line, but not pt1
template <class DirType>
bool GetLineIntersection(Vec<20>& rval, const DirType& N, const Fixed<20>& baseN, const Vec<20>& pt1, const Vec<20>& pt2)
{
const auto lastN = Dot<20>(pt1, N);
const auto nextN = Dot<20>(pt2, N);

//Point is exactly on side
if (nextN == baseN)
{
rval = pt2;
return true;
}
else if (((nextN>baseN) && (lastN<baseN)) || ((nextN<baseN) && (lastN>baseN)))
{
//Clip the two sides together to get intersection point. We use a waited sum of the points on either side
const auto W2 = Div<30>(baseN - nextN, lastN - nextN);
const auto W1 = Fixed<30>(1073741824 /*1.0*/) - W2;
assert(W2.val >= 0 && W1.val >= 0);

rval = VecMult<20>(pt2, W1) + VecMult<20>(pt1, W2);
return true;
}

return false;
}

template <class T1, class T2> //Should be shape types
bool FindContactPoints(Array<Vec<20>, 2>& Points, const Vec<30>& N, const Vec<20>& centerOffset, const T1& shapeA, const T2& shapeB)
{
//Figure out what side of a is being intersected. If none match the norm, return false (failure)
int aside = -1;
for(u32 i=0; i<4; ++i)
{
if (static_cast<decltype(N)>(shapeA.GetSidePerp(i)) == N)
{
aside = i; break;
}
}

if (aside < 0) {return false;}

//Array<Vec<20>, 2> Points;
u32 pi = 0;

const auto baseN = Dot<20>(shapeA.GetPoint(aside), N);

for(u32 i=0; i<4 && pi < 2; ++i)
{
//If there's an intersection, it will store in first argument and return true
if (GetLineIntersection(Points[pi], N, baseN, centerOffset + shapeB.GetPoint(i-1), centerOffset + shapeB.GetPoint(i)))
{ ++pi; }
}

//If no points were found return failure
if (pi <= 0) {return false;}

//Now make sure that both points are actually on the side, i.e. clip them to the corners if not
const auto baseT1 = Dot<20>(shapeA.GetPoint(aside), Rot90(N));
const auto baseT2 = Dot<20>(shapeA.GetPoint(aside+1), Rot90(N));
assert(baseT2 < baseT1);

for(u32 p=0; p<pi; ++p)
{
const auto newT = Dot<20>(Points[p], Rot90(N));

if (newT >= baseT1) {Points[p] = shapeA.GetPoint(aside);}
else if (newT <= baseT2) {Points[p] = shapeA.GetPoint(aside+1);}
}

//It is possible to only find one point if the shape has a corner just barely touching the other. In this case, just double it for now
//And the calling function will deal with it
if (pi == 1) {Points[pi++] = Points[0];}
assert(pi == 2);
return true;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
template <class T2> //Should be shape types
void ShapeCollisionTest(ContactList& contacts, CubeData* bodyA, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
assert(bodyA && bodyB);
const auto& shapeA = bodyA->shape();
const auto dist = (bodyB->pos + offsetB) - bodyA->pos;

Vec<30> bestNorm;
auto bestPen = Fixed<20>::MAX();
GetPenetrationSAT(bestNorm, bestPen, dist, shapeA, shapeB);
if (bestPen.val < 0) {return;}

Array<Vec<20>, 2> Points; //Contact points
bool done = false;
done = FindContactPoints(Points, -bestNorm, dist, shapeA, shapeB);

if (!done) //Try cutting against the sides of shape B if shape A didn't work
{
done = FindContactPoints(Points, bestNorm, -dist, shapeB, shapeA);

if (!done)
{
PRINT("!!!!!!!!!!!!!!!!!!!!!!!!!");
PRINT(bestNorm);
PRINT(dist);
PRINT(shapeA.GetPoint(0));
PRINT(shapeB.GetPoint(0));
PRINT(cube.dir);
}

assert(done);

Points[0] += dist; //FCP returns points in the reference frame of the first shape argument.
Points[1] += dist; //Since we passed b first, we have to move them back to a
}

contacts.addCubeContact(bodyA, bodyB, massB, bestNorm, -Points[0], -Points[1], bestPen);
}

template <class T2> //Should be shape types
void ShapeCollisionTest(ContactList& contacts, PickoryData* bodyA, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
assert(bodyA && bodyB);
const auto& shapeA = bodyA->shape();
const auto dist = (bodyB->pos + offsetB) - bodyA->pos;

//Vec<30> bestNorm;
UnitVec bestNorm;
auto bestPen = Fixed<20>::MAX();
GetPenetrationSAT(bestNorm, bestPen, dist, shapeA, shapeB);
if (bestPen.val < 0) {return;}

contacts.addContact(bodyA, bodyB, massB, bestNorm, bestPen);
}

template <class T2>
void DoShapeCollisionTests(ContactList& contacts, CubeData* bodyA1, PickoryData* bodyA2, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
ShapeCollisionTest(contacts, bodyA1, bodyB, offsetB, shapeB, massB);
ShapeCollisionTest(contacts, bodyA2, bodyB, offsetB, shapeB, massB);
}

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

void LineCollisionTest(ContactList& contacts, CubeData* bodyA, const Surface& s)
{
PosVel* const bodyB = &dummy;
assert(bodyA);

const auto& shapeA = bodyA->shape();
const auto T = s.dir;
const auto N = -Rot90(s.dir);
assert(T == Rot90(N));

const auto BBsize = Fixed<20>(22985121 /*21.92032*/); // Half width of a bounding square for the cube

const auto surfN = Dot<20>(s.pos, N);
const auto cubeN = Dot<20>(bodyA->pos, N);

//Bounding box check
if (cubeN >= surfN + BBsize) {return;}
if (cubeN < surfN) {return;} //If cube is already more than half way in, there's no real hope of resolving the collision, so just bail

const auto surfT1 = Dot<20>(s.pos, T);
const auto surfT2 = surfT1 + s.length;
const auto cubeT = Dot<20>(bodyA->pos, T);

//Bounding box check
if (cubeT <= surfT1 - BBsize) {return;}
if (cubeT >= surfT2 + BBsize) {return;}

auto bestPen = -Fixed<20>::MAX(); //Best penetration
Array<Vec<20>, 2> Points; //Contact points
u32 pi = 0;

//Clip each side of the cube to get the contact points and calculate penetration while we're at it
for(u32 i=0; i<4 && pi < 2; ++i)
{
const auto nextN = cubeN + Dot<20>(shapeA.GetPoint(i), N);
bestPen = Max(bestPen, surfN - nextN);

//If there's an intersection, it will store in first argument and return true
if (GetLineIntersection(Points[pi], N, surfN, bodyA->pos + shapeA.GetPoint(i-1), bodyA->pos + shapeA.GetPoint(i)))
{ ++pi; }
}

//Check if penetration is nonnegative
if(bestPen.val < 0) {return;}

if (pi == 1) {Points[pi++] = Points[0];}
assert(pi == 2);

const auto ptT1 = Dot<20>(Points[0], T);
const auto ptT2 = Dot<20>(Points[1], T);

//Check if the collision actually occured within the bounds of the line segment
if ((ptT1 <= surfT1 && ptT2 <= surfT1) || (ptT1 >= surfT2 && ptT2 >= surfT2)) {return;}

for(int pi=0; pi<2; ++pi)
{
const auto temp = Dot<20>(Points[pi], T);

if (temp <= surfT1) {Points[pi] = s.pos;}
else if (temp >= surfT2) {Points[pi] = s.pos + VecMult<20>(s.dir, s.length);}
}

contacts.addCubeContact(bodyA, bodyB, mtSTATIC, N, bodyA->pos - Points[0], bodyA->pos - Points[1], bestPen);
}

void LineCollisionTest(ContactList& contacts, PickoryData* bodyA, const Surface& s)
{
PosVel* const bodyB = &dummy;
assert(bodyA);

const auto& shapeA = bodyA->shape();
const auto T = s.dir;
const auto N = -Rot90(s.dir);

//calculate bounding box size along N and T
const auto boxN = Abs(shapeA.GetPoint(0) * N);
const auto boxT = Abs(shapeA.GetPoint(0) * T);

const auto surfN = Dot<20>(s.pos, N);
const auto kidN = Dot<20>(bodyA->pos, N);

//Bounding box check
if (kidN > surfN + boxN) {return;}
if (kidN < surfN) {return;} //If kid is already more than half way in, there's no real hope of resolving the collision, so just bail

const auto surfT1 = Dot<20>(s.pos, T);
const auto surfT2 = surfT1 + s.length;
const auto kidT = Dot<20>(bodyA->pos, T);

//Bounding box check along T
if (kidT <= surfT1 - boxT) {return;}
if (kidT >= surfT2 + boxT) {return;}

const auto bestPen = surfN - (kidN - boxN);
contacts.addContact(bodyA, bodyB, mtSTATIC, N, bestPen);
}

template <class T1> //should be CubeData or PickoryData
bool EdgeChainCollisionTest(ContactList& contacts, T1* bodyA, const Surface& s1, const Surface& s2)
{
assert(s2.pos == s1.end());
const auto& shapeA = bodyA->shape();

//Check if corner is convex
if (Rot90(s1.dir) == s2.dir)
{
//Check if corner is inside cube
if (shapeA.Contains(s2.pos - bodyA->pos))
{
const auto offset = half(s2.end() - s1.pos);
const auto center = s1.pos + offset;
const Rect r(offset);
ShapeCollisionTest(contacts, bodyA, &dummy, center, r, mtSTATIC);
return true;
}
}

//If we reach this point, we didn't do a corner check
//LineCollisionTest(contacts, bodyA, s1);

if (portals.Linked() && s1.portal)
{
ArrayList<interval, 4> intvs;

//Calculate pieces
for(u32 ptli=0; ptli<2; ++ptli)
{
if (portals[ptli].surf == &s1)
{
intvs.add(portals[ptli].left);
intvs.add(portals[ptli].right);
}
}

for(u32 ivi = 0; ivi<intvs.size(); ++ivi)
{
const auto& iv = intvs[ivi];
//if (iv.s2 == iv.s1) {continue;} //skip intervals with 0 length

const auto offset = VecMult<20>(s1.dir, half(iv.s2 - iv.s1)) + VecMult<20>(Rot90(s1.dir), Fixed<20>(16777216 /*16.0*/));
const auto center = s1.pos + offset + VecMult<20>(s1.dir, iv.s1);

const Rect r(offset);
ShapeCollisionTest(contacts, bodyA, &dummy, center, r, mtSTATIC);
}

if (intvs.size() > 0) {return false;}
}

LineCollisionTest(contacts, bodyA, s1);
return false;
}

Yiminy! You know that you don't need to use the "auto" keyword, right? It's implicit.

No I'm using the C++0x auto keyword for automatic type deduction.
It makes the code more generic and easier to change.

I think you are going to need to do a little profiling to see where your time is spent. nothing really exists for the ds/gba platform. there are hardware timers that you can use - although you will have to manually insert time sampling into your code.

Do you have any guides to timing? I haven't used timers before. (Or interrupts for that matter)

Also, how do you make sure that all the code and data is placed in the right section?

Do you have access to any sort of profiler? Without that, you are indeed guessing on where the slowdown is. Perhaps even an intrusive, timer based callback that stores and counts stack locations (a profiler) you could write yourself.

Looking at the code you uploaded:
Everytime I see newT, I jump and think you are allocating memory. :)
pi as a local also threw me off.
What exactly does Array<> do? Is this allocating heap or stack memory and how heavy is it? In general, dynamic memory allocations tend to get pretty heavy on performance if done in tight loops.

I'd also venture to say that this is a pretty heavy duty collision detection routine for a GBA, but if you aren't doing too much else, it's probably fine.

Overall, this is very well constructed code. Easy to follow and well asserted. :)

I see no immediately obvious things that are "slow" here outside of the possibility of dynamic memory allocations for the Array and ArrayList stuff. You minimize divides, don't sqrt, don't float, and do your work quickly. Have you considered moving some of this code to ITCM and/or making it ARM/Thumb to see if it helps? I'd also like to confirm your stack is it DTCM. Finally, do you have other code running like sound/music?

I made a modified version of VisualBoyAdavnce that does simple profiling (counts the number of times each memory address was executed). For ARM instructions only.

I use it with some other programs I made that put it against a disassembly.

I usually found that inner loops account for by far the most time of any program.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I don't have any sound or music code. I figured that is something I can worry about after I get the rest of the game working.

Here's my array classes. I don't do any dynamic memory allocations.

The ArrayList is basically just a plain array that also keeps track of how many elements have been stored. I had troubling deciding what to name it but I eventually decided ArrayList would be the least confusing name.

Code:

#pragma once
#include "ints.h"
#include "macros.h"

template <typename T, u32 N>
class Array
{
T arr[N];

public:

T& operator[](u32 i) {assert(i < size()); return arr[i];}
const T& operator[](u32 i) const {assert(i < size()); return arr[i];}
static u32 size() {return N;}
};

template <typename T, u32 N>
class ArrayList
{
T arr[N];
u32 m_size;

public:
ArrayList() : m_size(0) {}

T& operator[](u32 i) {assert(i < size()); return arr[i];}
const T& operator[](u32 i) const {assert(i < size()); return arr[i];}

u32 size() const {return m_size;}
u32 capacity() const {return N;}

void clear() {m_size = 0;}

void add(const T& val)
{
assert(m_size < N);
if (m_size < N) {arr[m_size++] = val;}
}

typedef T* iterator;
typedef const T* const_iterator;
const T* begin() const {return arr;}
const T* end() const {return arr+size();}
T* begin() {return arr;}
T* end() {return arr+size();}

//These functions are useful for adding elements in place
//T& next() {assert(m_size < N); return arr[m_size];}
//void sizeinc() {m_size++;}
};

template <typename T>
struct ArrayConstPtr
{
const T* arr;
u32 N;

//T& operator[](u32 i) {assert(i < size()); return arr[i];}
const T& operator[](u32 i) const {assert(i < size()); return arr[i];}
u32 size() const {return N;}

typedef const T* iterator;
typedef const T* const_iterator;
const T* begin() const {return arr;}
const T* end() const {return arr+size();}
};

template <typename T> auto begin(T& container) -> decltype(container.begin()) {return container.begin();}
template <typename T> auto end(T& container) -> decltype(container.end()) {return container.end();}
template <typename T> auto begin(const T& container) -> decltype(container.begin()) {return container.begin();}
template <typename T> auto end(const T& container) -> decltype(container.end()) {return container.end();}

You do have optimizations turned on, right?
You could try defining NDEBUG to remove all asserts from the generated code.

Otherwise, find out what's executed the most (with a profiler), build it as ARM code and move it to IWRAM.

Also, look at the disassembly of your code to see how good it is.

If you send me the .ELF file, I'll try to profile it.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Yes, I have O2 on. I already posted my compiler flags in the first post.
And the assert is already just defined as nothing. I have no idea how to code a useful assert statement for the GBA anyway.

As far as looking at the assembly, I don't know how to do that. Do you have any guides? Also, how do I send you the elf?

Code:

#pragma once

#define STATICASSERT(x) static_assert((x), #x)

#ifdef TESTING
#include <cassert>
#define PRINT(x) std::cout << #x ": " << (x) << '\n'
#else
#define PRINT(x)
#define assert(x)
#endif

Easy way to do disassembly:
arm-eabi-objdump -d filename.elf > outputfile.s

If you are using the example makefiles, you get an ELF file before it makes a GBA file. The ELF file is basically the same as the GBA file since the code is already relocated to its final addresses, but it still has the symbols intact.

Right now, my profiling tools are very limited, you need to build it as ARM mode instead of thumb mode, but they still work okay.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I have the elf file, I was just wondering how to send it to you.

Edit: I figured out that dekitpro was making a multiboot image by default. I tried changing it, but there is no visible effect on speed.

Last edited by GLaDOS on Fri Jul 22, 2011 8:08 pm; edited 2 times in total

Just post it on my board somewhere, I allow guests to upload files.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I tried to post it but the attachment won't show up.

Sorry, forgot to turn off the 100k size limit... I increased it to 10MB.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I added a rudimentary by recording the scanline counter at certain points and drawing a sprite at that location. It's crude but at least it's better than nothing.

I found some divisions in my contact solving code that seem unnecessary. I tried to replace the 3 divisions with 1 division followed by multiplication by the reciprocal. There was a large speedup, but unfortunately, the physics became very unstable.
Presumably there were large rounding errors introduced by taking the reciprocal, though I'm not sure why. Even in the worse case scenario, it should have a relative error of at most 1/139665.

Code:

//Construct inverse matrix
const auto cc = Mult<20>(c, c); // max 1922, min .25
const auto det = Mult<19>(cc, invM); // 1922*2 to .25
const auto iim = Mult<24>(I, invM); // 40 to 80

//Multiplying by reciprocal is faster than division
/*const auto invDet = Div<28>( Fixed<30>(1073741824), det );

const auto AInv11 = Mult<19>( Add<21>(iim, bb), invDet);
const auto AInv12 = Mult<19>( Add<21>(iim, ab), invDet);
const auto AInv21 = AInv12;
const auto AInv22 = Mult<19>( Add<21>(iim, aa), invDet);*/

const auto AInv11 = Div<19>( Add<21>(iim, bb), det); // max 2242
const auto AInv12 = -Div<19>( Add<21>(iim, ab), det);
const auto AInv21 = AInv12;
const auto AInv22 = Div<19>( Add<21>(iim, aa), det);

I can see that you are using the "Noob method" of waiting for vblank, one that repeatedly checks VCOUNT inside a while loop. It sticks out like a sore thumb in the profile. You should replace it with a call to VBlankIntrWait(), which halts the processor until an interrupt happens, and uses less power.

Looks like it spends lots of time in __aeabi_lmul and __aeabi_uidiv, which are slow ways of doing multiplication and division.

Normally multiplying two 32 bit numbers into a 64 bit number is a single ARM instruction that takes 7 cycles to complete, not a call to a 36 instruction long THUMB multiply function. If you can, find a way to eliminate 64 bit x 64 bit multiplication, and get it down to 32 bit x 32 bit.
Also, if you are dividing by the same number a lot, find out its reciprocal, then multiply by it instead. So you take 0x100000000 (a 64 bit number), and do the division using some slower division code. You get the reciprocal. You can then use 32x32 to 64 bit multiplication to divide, just by discarding the low 32 bits of your result.

I suggest you try to avoid using THUMB mode, since it just makes horrible choices of how to do multiplication and division.

Also, the program fails to run on NO$GBA, but runs fine on VisualBoyAdvance.

If you still can't get the compiler to generate good code for multiplying, you can use inline assembly to force it to use the MULL/UMULL instruction.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Wait, so if I have a line like
s64 p = (s64)a.val * (s64)b.val;
should I just get rid of the casts?

Also, I already got rid of several divisions since I posted that elf. The remaining ones are either impossible to get rid of or cause bugs when removed (I'm looking into that at the moment.)

And sorry about the noob thing. I'm not very experienced with GBA coding. I've never even used an interrupt before.
By the way, do you know why it doesn't run on No$GBA? I use VBa so I didn't notice.

EDIT: I tried getting rid of the s64 casts in the multiplication function, and everything completely broke, presumably due to overflow. If I get rid of only one of the casts, then it works like normal but there is also no speed increase.

Build in ARM mode, not THUMB mode.

This little piece of code here:

Code:

s32 test(s32 a, s32 b)
{
s64 value = (s64)a * (s64)b;
return value >> 32;
}

In ARM mode, it builds to this:

Code:

test:
str r4, [sp, #-4]!
smull r3, r4, r1, r0
mov r0, r4
ldmfd sp!, {r4}
bx lr

Not too bad, but the compiler could have done this better, eliminate the mov, and eliminate the stack push/pop.

In THUMB mode, it builds to this:

Code:

test:
push {r3, lr}
mov r2, r0
asr r3, r2, #31
mov r0, r1
asr r1, r1, #31
bl __aeabi_lmul
@ sp needed for prologue
mov r0, r1
pop {r3}
pop {r1}
bx r1

So you want to avoid THUMB mode for anything that does any multiplication, otherwise it calls some slow multiply function to actually do the work, instead of simply using the SMULL instruction.

Using reciprocals for signed division is a tad problematic, you should work with absolute values, and set the sign at the end. Otherwise you could end up with problems such as negative numbers off by 1.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I changed the options from -mthumb -mthumb-interwork to -marm and it's now much faster. Thanks!

It now lags only in complicated situations.

Ah the beuaty of profiling, showing exactly where one needs to improve things :)

I figured out why the game broke when I change the division with multiplication by reciprocal.

It turns out to be a really stupid mistake. I accidentally got rid of a minus sign while changing the divisions to multiplications.

GLaDOS wrote:

No I'm using the C++0x auto keyword for automatic type deduction.
It makes the code more generic and easier to change.

Oh yes! They've re-purposed "auto" to work like C#'s "var".

In fact, you can just "#define var auto" if you really like the var keyword.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Also note that VBA tends to have worse timings than the real thing. If you run the game on hardware you might find that it's faster.

Most people will probably be playing it on an emulator though since I don't exactly have the resources to manufacture cartridges.

Update: Once again, my game is running too slow. I think one possible target for optimization is my portal spark code.
It is intended to create an animated spark effect coming out of portals when they are linked. Each spark has a random start position, velocity, and lifetime.

Code:

class Xorshift32
{
u32 val;

public:
Xorshift32(u32 seed=0x92D68CA2): val(seed) {}

u32 GetNext()
{
val ^=(val<<13);
val ^=(val>>17);
val ^=(val<<5);

assert(val != 0);
return val;
}

//GetNext always returns nonzero. Shift so it is never intmin
s32 GetSigned() { return (GetNext() + 0x80000000u); }
};

struct SparkData{
Vec<20> pos;
Vec<20> vel;
u32 time;
};

Array<ArrayList<SparkData, 16>, 2> sparkdata;
Xorshift32 prng;

void updateSparks(u32 pi)
{
auto& data = sparkdata[pi];

for(u32 i = 0; i < data.size();)
{
if (data[i].time)
{
data[i].pos += data[i].vel;
--data[i].time;
++i;
}
else {data.swapremove(i);} //Don't increment i if we're removing an element or else we'll skip one
}

if (portals.Linked() && data.size() < data.capacity())
{
//CreateSpark(myprng.GetRange(-HALFWIDTH(),HALFWIDTH()), -2, myprng.GetRange(.7,1.3), 5 + myprng.GetNext() % 23);
const auto px = prng.GetSigned() >> 6; // -32.0 to 32.0
const auto vy = 393216 /*0.375*/ + (s32)(prng.GetNext() >> 14); //0.375 to 0.625

//generate number from 0 to 44 without using modulo. 0 - 18 are slightly more likely
//empirically, it doesn't appear to make any difference in speed though, probably because it only runs once or twice per tick
u32 t = prng.GetNext() >> 26;
if (t >= 45) {t -= 45;}

SparkData s;
s.pos = portals[pi].ToWorld( Vec<20>(px, 0) );
s.vel = portals[pi].Transform( Vec<20>(0, vy) );
s.time = 10u + t;
data.add(s);
}
}

Break out the profiler again and see what still needs optimization?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

How do I profile it?

Tools needed:
DevKitArm (for arm-eabi-objdump)
VBA_TEH_PROFILER.EXE
profiler_merge.exe
TextPad Text Editor (because it has a Sort feature)

Start with your ELF file. Disassemble it to ASM:

arm-eabi-objdump -D portal.elf > portal.s

Run your GBA file in a special version of VBA that I built, called "VBA_TEH_PROFILER". Run it for a short time, 2 minutes is plenty, making sure to cover all the code you're interested in checking. Then save the profile log using "File > Save Profile Data".

Run "profiler_merge" to merge the instruction execution count with the disassembly.

profiler_merge portal.s portal.profile portal_output.s

Now you have some huge file that looks like this:

Code:

00001E30 08000304 <_Z17NeedToChangeBlocki>:
00001E30 8000304: b570 push {r4, r5, r6, lr}
00001E30 8000306: 4b14 ldr r3, [pc, #80] ; (8000358 <_Z17NeedToChangeBlocki+0x54>)
00001E30 8000308: 681a ldr r2, [r3, #0]
00001E30 800030a: 2001 movs r0, #1
00001E30 800030c: 2a00 cmp r2, #0

... snipped ...

00001E26 8000348: 428d cmp r5, r1
00001E26 800034a: db00 blt.n 800034e <_Z17NeedToChangeBlocki+0x4a>
00001E26 800034c: 2000 movs r0, #0
00001E26 800034e: 0600 lsls r0, r0, #24
00001E2B 8000350: 0e00 lsrs r0, r0, #24
00001E2B 8000352: bc70 pop {r4, r5, r6}
00001E30 8000354: bc02 pop {r1}
00001E30 8000356: 4708 bx r1

Left column is the number of times that instruction was executed. (it's approximate for THUMB instructions though, it counts both halves of the word as the same address, but it's accurate for ARM instructions)
Middle column is the address in ROM of the code.
Rest of the line is the disassembly.

Now get TextPad, a great text editor. It can sort the file. If you sort the text by column 1 in descending order, it will put all the most executed code at the top of the file. Of course, everything will be in semi-random order after a sort, so you can search the original file for the address if you want to see the code near there.

You usually see that inner loops are much hotter than their surrounding code, and they tend to need optimization the most.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Here's the first few lines of the results. I'm guessing that the first 3 are the busy waiting part of the game loop (haven't tried interrupts yet). But that won't affect speed for obvious reasons.

Code:

03072536 80071bc: 9afffffc bls 80071b4 <main+0xe4>
03072536 80071b8: e353009f cmp r3, #159 ; 0x9f
03072536 80071b4: e1d430b6 ldrh r3, [r4, #6]
00D020CE 80071b0: 8afffffc bhi 80071a8 <main+0xd8>
00D020CE 80071ac: e353009f cmp r3, #159 ; 0x9f
00D020CE 80071a8: e1d430b6 ldrh r3, [r4, #6]
000E3494 800048c: 1affffe4 bne 8000424 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x60>
000E3494 8000488: e1520006 cmp r2, r6
000E3494 8000484: e5932008 ldr r2, [r3, #8]
000A426B 8000428: 0a00001e beq 80004a8 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0xe4>
000A426B 8000424: e1520005 cmp r2, r5
0008DB54 8006008: 1afffffb bne 8005ffc <memcpy+0xc0>
0008DB54 8006004: e4c3c001 strb ip, [r3], #1
0008DB54 8006000: e2522001 subs r2, r2, #1
0008DB54 8005ffc: e4d1c001 ldrb ip, [r1], #1
0007A0F9 8000434: 0a000009 beq 8000460 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x9c>
0007A0F9 8000430: e3510000 cmp r1, #0
0007A0F9 800042c: e5971000 ldr r1, [r7]
00078766 8000480: 0a000015 beq 80004dc <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x118>
00078766 800047c: e58a3000 str r3, [sl]
00078766 8000478: e1a04003 mov r4, r3
00078766 8000474: e58c2010 str r2, [ip, #16]
00078766 8000470: e1580003 cmp r8, r3
00078766 800046c: e0812002 add r2, r1, r2
00078766 8000468: e2843018 add r3, r4, #24
00078766 8000464: e59c1010 ldr r1, [ip, #16]
00078766 8000460: e5932010 ldr r2, [r3, #16]
0006F085 080003c4 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb>:
0006F085 80004a4: e12fff1e bx lr
0006F085 80004a0: e8bd0ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp}
0006F085 8000420: ea000017 b 8000484 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0xc0>
0006F085 800041c: e269b000 rsb fp, r9, #0
0006F085 8000418: e2656000 rsb r6, r5, #0
0006F085 8000414: e59c9008 ldr r9, [ip, #8]
0006F085 8000410: e59c500c ldr r5, [ip, #12]
0006F085 800040c: 0a000032 beq 80004dc <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x118>
0006F085 8000408: e1a03004 mov r3, r4
0006F085 8000404: e1580004 cmp r8, r4
0006F085 8000400: 158c3010 strne r3, [ip, #16]
0006F085 80003fc: 18850003 stmne r5, {r0, r1}
0006F085 80003f8: 13a03000 movne r3, #0
0006F085 80003f4: 18960003 ldmne r6, {r0, r1}
0006F085 80003f0: 18a5000f stmiane r5!, {r0, r1, r2, r3}
0006F085 80003ec: 11a0500c movne r5, ip
0006F085 80003e8: 18b6000f ldmne r6!, {r0, r1, r2, r3}
0006F085 80003e4: e1a08002 mov r8, r2
0006F085 80003e0: e1a0a001 mov sl, r1
0006F085 80003dc: e1a07003 mov r7, r3
0006F085 80003d8: e1a0c000 mov ip, r0
0006F085 80003d4: 11a06004 movne r6, r4
0006F085 80003d0: e35c0000 cmp ip, #0
0006F085 80003cc: e5914000 ldr r4, [r1]
0006F085 80003c8: e5ddc020 ldrb ip, [sp, #32]
0006F085 80003c4: e92d0ff0 push {r4, r5, r6, r7, r8, r9, sl, fp}
000671DA 8005de0: 1affffef bne 8005da4 <__aeabi_uidiv+0x58>
000671DA 8005ddc: 11a01221 lsrne r1, r1, #4
000671DA 8005dd8: 11b03223 lsrsne r3, r3, #4
000671DA 8005dd4: e3500000 cmp r0, #0
000671DA 8005dd0: 218221a3 orrcs r2, r2, r3, lsr #3
000671DA 8005dcc: 204001a1 subcs r0, r0, r1, lsr #3
000671DA 8005dc8: e15001a1 cmp r0, r1, lsr #3
000671DA 8005dc4: 21822123 orrcs r2, r2, r3, lsr #2
000671DA 8005dc0: 20400121 subcs r0, r0, r1, lsr #2
000671DA 8005dbc: e1500121 cmp r0, r1, lsr #2
000671DA 8005db8: 218220a3 orrcs r2, r2, r3, lsr #1
000671DA 8005db4: 204000a1 subcs r0, r0, r1, lsr #1
000671DA 8005db0: e15000a1 cmp r0, r1, lsr #1
000671DA 8005dac: 21822003 orrcs r2, r2, r3
000671DA 8005da8: 20400001 subcs r0, r0, r1
000671DA 8005da4: e1500001 cmp r0, r1

It looks like SurfaceMerger is a good place for optimization, but I'm not sure what to do about it. Here's the source.

Code:

enum class BreakType {LISTEND, CONCAVE, CONVEX, PORTAL_MID, PORTAL_END};
//Automatically merge adjacent surfaces and return type of break at the end
BreakType SurfaceMerger(Surface& s, const Surface*& sp, const Surface* end, const PortalManager& portals, bool reset)
{
assert(sp && end);
assert(reset || (sp == end) || (s.end() == sp->pos));

if (reset) //If the next surface might be a portal, we want to resuse that code. So create a sort of stub surface and continue as normal
{
assert(sp != end);
s = *sp;
s.length.val = 0;
}

while (sp != end)
{
if (-Rot90(s.dir) == sp->dir) {return BreakType::CONCAVE;}
if (Rot90(s.dir) == sp->dir) {return BreakType::CONVEX;}

assert(s.dir == sp->dir);
if (portals.Linked() && sp->portal())
{
//If next surface is split up by portals, merge only the first portion and return appropriately
if (portals[0].surf == sp && portals[1].surf == sp)
{
if (portals[0].left.s1.val == 0) {s.length += portals[0].left.s2;}
else {assert(portals[1].left.s1.val == 0); s.length += portals[1].left.s2;}
return BreakType::PORTAL_MID;
}
else if (portals[0].surf == sp)
{
s.length += portals[0].left.s2;
return BreakType::PORTAL_END;
}
else if (portals[1].surf == sp)
{
s.length += portals[1].left.s2;
return BreakType::PORTAL_END;
}
}

s.length += sp->length;
++sp;
}

return BreakType::LISTEND;
}

Don't just look at the sorted code, look back at the original code too, especially whenever the execution count changes, so you can see what function it's part of. Just because some labels appear in the sorted toplist doesn't mean they are necessarily smoking guns.

And if you can't optimize something, call it less often. For example, for collision detection, there are ways to reduce the number of collision checks by maintaining separate object lists.

edit: also, what's up with the byte-based memcpy I see there? Which data isn't word aligned?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

The problem is that I don't know assembly so I can't really tell what's going on.

I can't even read compiler output very well, the labels suck, and there's no comments.
But you don't really need to read ASM, just look for the function, and see which loop seems to be hot. They correspond roughly to the original C++ source.

By the way, what exactly does surface merging do? And can you figure out how to call the function less often?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Well when you have adjacent surfaces, you don't want to create a separate contact for each one because it's slower and less accurate.

I suppose one way to avoid that is to maintain a list of surfaces in memory, assuming there's enough space for that.

Also, you can speed up your program by eliminating the explicit Vblank wait if you've already hit vblank. Then you can frame rates that continuously vary up to 60, instead of just 60 or 30.

I think sprite uploading still has to be done during Vblank time though, so use IRQs for that. Use triple buffering for the sprite tables, one is the physical OAM, one "to be copied to OAM", and one "to be written to". Swap the "one to copy" and "one to write to" after every frame is finished calculating. (this is a pointer swap, not a data copy)
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Variable framerates are a nightmare to work with and rather pointless as well since everything is visible at an integral multiple of the refresh rate anyway.

I guess what I really need is a more efficient way to do collision detection but I'm having trouble coming up with a solution.

By the way, the same code usually runs 4-8 times as fast when moved into IWRAM, vs executing the code from ROM. So see what you can move there.

You can also use Video RAM for fast code execution, but Devkitarm doesn't directly support it.

If you have any lookup tables with 8 or 16 bit values, stick them in VRAM too.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Last edited by Dwedit on Fri Aug 19, 2011 10:47 pm; edited 1 time in total

How do you tell what's in ROM and what's in IWRAM? Also, how do you tell how much free space in IWRAM you have? It might be especially tricky figuring out the maximum stack size.

GBATEK has a table showing how slow each type of memory is:

Code:

Region Bus Read Write Cycles
BIOS ROM 32 8/16/32 - 1/1/1
Work RAM 32K 32 8/16/32 8/16/32 1/1/1
I/O 32 8/16/32 8/16/32 1/1/1
OAM 32 8/16/32 16/32 1/1/1 *
Work RAM 256K 16 8/16/32 8/16/32 3/3/6 **
Palette RAM 16 8/16/32 16/32 1/1/2 *
VRAM 16 8/16/32 16/32 1/1/2 *
GamePak ROM 16 8/16/32 - 5/5/8 **/***
GamePak Flash 16 8/16/32 16/32 5/5/8 **/***
GamePak SRAM 8 8 8 5 **

"GamePak ROM" costs 8 cycles to execute an ARM instruction, and 5 cycles to execute a THUMB instruction. That's why people usually build things in THUMB mode, since it runs code faster from the slowest kind of memory.
But IWRAM costs 1 cycle to execute an ARM instruction, so that's the best place to put ARM code.

But we've seen that GCC acts stupidly in THUMB mode, and generates very crappy code if you do any long multiplication.

You can tell the DevKitARM magic makefiles which mode you want your code generated for by renaming the source file, so "File.thumb.cpp" makes THUMB code, but "File.arm.cpp" makes ARM code.

All your code is in ROM, unless you put it somewhere else. Use "IWRAM_CODE" to mark a function as one to move to IWRAM. There are also filename features for ".iwram.cpp", but they don't actually work, instead it just does the same as ".arm.cpp"

Global variables go into the BSS section, which is in IWRAM.
Heap variables (allocated with the New operator or malloc) go into EWRAM.
The stack is in IWRAM.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

What happens if you try to put too much code in IWRAM? Is it just a linker error? Or do you get a runtime stack overflow or what?

If you put too much code in RAM, you get link errors. If you put barely too much code that it's still within the 32k limit, but don't have enough space for the stack actually used, your global variables and RAM code eventually gets clobbered by the stack, and it crashes.

To check how much memory your code is using, loiok at the .map file in the build directory.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I rewrote the searching algorithm and it's much faster now. Before, it did a brute force search against even surface in the level. Now I divide up the level into 128x128 blocks and only search the surfaces in the same block as the shape being tested. The level data now takes up more space due to all the extra data structures involved but its faster at runtime, and I don't think there's any shortage of ROM.

I'd estimate that the collision search phase is now twice as fast as it was before. And the speedup factor will only increase as I larger and more complicated levels.

Another question: It is worth disabling windows when I'm not using them? I'm planning on using windows for the level transitions, but they won't be used during normal gameplay.

As long as you're displaying stuff correctly, it doesn't matter whether windows are enabled or not. In fact, windows are the recommended way to disable background layers on a scanline-by-scanline basis.

Remember that window parameters can be changed each scanline, so your effect doesn't need to be a simple rectangle. One of the simplest scanline-by-scanline window effects is to make a circle instead of a rectangle. But you can make it any shape you want, as long is there's only one hole in the middle. If you need even more possible shapes, you can use the Object Window feature, where sprites become windows.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Ug, I made a new level with some more features and now the game is too slow again. It looks like the main culprit is the collision detection phase, which is now taking almost half the frame for no apparent reason.

Also, I've done the profiling again, but I still have no clue how to make sense of the data.

By the way, you can do "arm-eabi-objdump -dS file.elf > output.s" to get source code annotations with the disassembly. I just read about this.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

How do you handle your collision detection and how many objects are you collding with?

gbadev.org forum archive

C/C++ > Optimization

#176393 - GLaDOS - Fri Jul 22, 2011 4:44 am

#176397 - elhobbs - Fri Jul 22, 2011 1:14 pm

#176398 - gauauu - Fri Jul 22, 2011 2:50 pm

#176400 - GLaDOS - Fri Jul 22, 2011 3:39 pm

#176401 - ant512 - Fri Jul 22, 2011 3:55 pm

#176402 - GLaDOS - Fri Jul 22, 2011 3:58 pm

#176403 - elhobbs - Fri Jul 22, 2011 6:11 pm

#176404 - GLaDOS - Fri Jul 22, 2011 6:16 pm

#176405 - Miked0801 - Fri Jul 22, 2011 6:20 pm

#176407 - Dwedit - Fri Jul 22, 2011 6:51 pm

#176408 - GLaDOS - Fri Jul 22, 2011 7:07 pm

#176409 - Dwedit - Fri Jul 22, 2011 7:26 pm

#176410 - GLaDOS - Fri Jul 22, 2011 7:30 pm

#176411 - Dwedit - Fri Jul 22, 2011 7:36 pm

#176412 - GLaDOS - Fri Jul 22, 2011 7:40 pm

#176413 - Dwedit - Fri Jul 22, 2011 7:43 pm

#176414 - GLaDOS - Fri Jul 22, 2011 7:51 pm

#176415 - Dwedit - Fri Jul 22, 2011 8:13 pm

#176417 - GLaDOS - Sat Jul 23, 2011 1:36 am

#176418 - Dwedit - Sat Jul 23, 2011 4:59 am

#176419 - GLaDOS - Sat Jul 23, 2011 6:00 am

#176420 - Dwedit - Sat Jul 23, 2011 7:35 am

#176422 - GLaDOS - Sat Jul 23, 2011 5:06 pm

#176423 - Miked0801 - Sat Jul 23, 2011 5:21 pm

#176424 - GLaDOS - Sun Jul 24, 2011 3:38 pm

#176425 - ant512 - Mon Jul 25, 2011 8:59 am

#176427 - Dwedit - Mon Jul 25, 2011 4:30 pm

#176555 - Exophase - Wed Aug 17, 2011 5:39 pm

#176556 - GLaDOS - Wed Aug 17, 2011 5:48 pm

#176563 - GLaDOS - Thu Aug 18, 2011 5:12 am

#176567 - Dwedit - Thu Aug 18, 2011 5:51 pm

#176568 - GLaDOS - Thu Aug 18, 2011 5:52 pm

#176569 - Dwedit - Thu Aug 18, 2011 6:29 pm

#176570 - GLaDOS - Thu Aug 18, 2011 6:57 pm

#176571 - Dwedit - Thu Aug 18, 2011 9:16 pm

#176572 - GLaDOS - Thu Aug 18, 2011 10:09 pm

#176573 - Dwedit - Thu Aug 18, 2011 10:50 pm

#176574 - GLaDOS - Thu Aug 18, 2011 11:00 pm

#176575 - Dwedit - Thu Aug 18, 2011 11:36 pm

#176576 - GLaDOS - Thu Aug 18, 2011 11:41 pm

#176580 - Dwedit - Fri Aug 19, 2011 10:40 pm

#176581 - GLaDOS - Fri Aug 19, 2011 10:41 pm

#176582 - Dwedit - Fri Aug 19, 2011 11:13 pm

#176583 - GLaDOS - Fri Aug 19, 2011 11:15 pm

#176584 - Dwedit - Fri Aug 19, 2011 11:21 pm

#176585 - GLaDOS - Sat Aug 20, 2011 3:46 am

#176592 - GLaDOS - Sun Aug 21, 2011 6:17 pm

#176593 - Dwedit - Sun Aug 21, 2011 9:40 pm

#176612 - GLaDOS - Fri Aug 26, 2011 9:01 pm

#176676 - Dwedit - Tue Sep 13, 2011 10:07 pm

#176677 - Miked0801 - Tue Sep 13, 2011 11:36 pm