gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

C/C++ > Optimization

#176393 - GLaDOS - Fri Jul 22, 2011 4:44 am

I'm writing a GBA game in simple C++ with devkitpro. After a great deal of effort, I've gotten a simple demo of the first room working and playable. The problem is that it is way to slow. As far as I can tell, my update function takes 1-2 frames to update, meaning that it only runs at half speed. So I need to at least double the speed in order to get it to run normally.

So now I'm trying to figure out how to optimize it. Unfortunately, I'm a beginner and have no idea where to start. One possible step I discovered is attempt to put more things in IWRAM. Unfortunately, I know nothing about linker scripts and I'm not sure how to find out what things are currently placed where, let alone how to optimize the section placement.

By the way, no I don't use exceptions, or rtti, or floating point, or virtual functions, or dynamic memory, or the standard library. So there aren't any really obvious speedups like that.

Also, I know that my physics code is the main bottleneck, so I don't need to worry about graphics (and I use DMA for that anyway). When I comment out parts of the physics code, my game runs at full speed.

Also, here are my current compile options
-g -Wall -O2 -mcpu=arm7tdmi -mtune=arm7tdmi -fomit-frame-pointer -ffast-math -mthumb -mthumb-interwork -fno-rtti -fno-exceptions -std=c++0x

#176397 - elhobbs - Fri Jul 22, 2011 1:14 pm

It is not possible to offer advice without more information. code would be best, or at a minimum a description of the algorithm you are using.

#176398 - gauauu - Fri Jul 22, 2011 2:50 pm

What elhobbs said is true, but also consider that it isn't necessarily bad to actually run at 30 fps instead of 60 -- miked says they often do that in commercial games, even.

If you can run at a solid 30, and make everything move twice as fast as what you had before, it's often "good enough" and gives you tons more time for processing each frame.

#176400 - GLaDOS - Fri Jul 22, 2011 3:39 pm

The problem is that I'm not sure what code to post.

Here's my collision detection code. It's probably one of the most performance hogging parts of the game.

Fixed<E>, Vec<E>, and UnitVec are part of the fixed point library I wrote. Fixed is a single fixed point number (32bit), where E is the shift. So Fixed<20> represents a 11.20 number, which is used for all coordinates in my game. Vec is a 2d vector. UnitVec is a special class for axis aligned unit vectors.
The named functions (Add, Mult, Div, etc.) use a 64bit intermediate value for better accuracy. The operators (+, -) just call the corresponding operator on the values directly without worrying about rounding or overflow.

Code:

#pragma once
#include "array.h"

template <class T, class DirType> //Should be shape types
Fixed<20> PenetrationAlongDir(const DirType& N, const Fixed<20> baseN, const Vec<20>& centerOffset, const T& shapeB)
{
    auto minB = Fixed<20>::MAX();

    for(u32 i=0; i<4; ++i)
    {
        auto newB = Dot<20>(N, centerOffset + shapeB.GetPoint(i));
        minB = Min(minB, newB);
    }

    return baseN - minB;
}

template <class T1, class T2, class DirType>
void GetPenetrationSAT(DirType& bestNorm, Fixed<20>& bestPen, const Vec<20>& dist /*centerOffset*/, const T1& shapeA, const T2& shapeB)
{
    bestPen = Fixed<20>::MAX();

    for(u32 i=0; i<4; ++i)
    {
        const auto N = shapeB.GetSidePerp(i);
        const auto baseN = Dot<20>(shapeB.GetPoint(i), N);
        const auto newPen = PenetrationAlongDir(N, baseN, -dist, shapeA);

        if (newPen < bestPen)
        {
            bestPen = newPen;
            bestNorm = N;
            if (newPen.val < 0) {return;}
        }
    }

    for(u32 i=0; i<4; ++i)
    {
        const auto N = shapeA.GetSidePerp(i);
        const auto baseN = Dot<20>(shapeA.GetPoint(i), N);
        const auto newPen = PenetrationAlongDir(N, baseN, dist, shapeB);

        if (newPen < bestPen)
        {
            bestPen = newPen;
            bestNorm = -N;
            if (newPen.val < 0) {return;}
        }
    }
}

//Will return pt2 if it is exactly on the line, but not pt1
template <class DirType>
bool GetLineIntersection(Vec<20>& rval, const DirType& N, const Fixed<20>& baseN, const Vec<20>& pt1, const Vec<20>& pt2)
{
    const auto lastN = Dot<20>(pt1, N);
    const auto nextN = Dot<20>(pt2, N);

    //Point is exactly on side
    if (nextN == baseN)
    {
        rval = pt2;
        return true;
    }
    else if (((nextN>baseN) && (lastN<baseN)) || ((nextN<baseN) && (lastN>baseN)))
    {
        //Clip the two sides together to get intersection point. We use a waited sum of the points on either side
        const auto W2 = Div<30>(baseN - nextN, lastN - nextN);
        const auto W1 = Fixed<30>(1073741824 /*1.0*/) - W2;
        assert(W2.val >= 0 && W1.val >= 0);

        rval = VecMult<20>(pt2, W1) + VecMult<20>(pt1, W2);
        return true;
    }

    return false;
}

template <class T1, class T2> //Should be shape types
bool FindContactPoints(Array<Vec<20>, 2>& Points, const Vec<30>& N, const Vec<20>& centerOffset, const T1& shapeA, const T2& shapeB)
{
    //Figure out what side of a is being intersected. If none match the norm, return false (failure)
    int aside = -1;
    for(u32 i=0; i<4; ++i)
    {
        if (static_cast<decltype(N)>(shapeA.GetSidePerp(i)) == N)
        {
            aside = i; break;
        }
    }

    if (aside < 0) {return false;}

    //Array<Vec<20>, 2> Points;
    u32 pi = 0;

    const auto baseN = Dot<20>(shapeA.GetPoint(aside), N);

    for(u32 i=0; i<4 && pi < 2; ++i)
    {
        //If there's an intersection, it will store in first argument and return true
        if (GetLineIntersection(Points[pi], N, baseN, centerOffset + shapeB.GetPoint(i-1), centerOffset + shapeB.GetPoint(i)))
            {   ++pi;   }
    }

    //If no points were found return failure
    if (pi <= 0) {return false;}

    //Now make sure that both points are actually on the side, i.e. clip them to the corners if not
    const auto baseT1 = Dot<20>(shapeA.GetPoint(aside), Rot90(N));
    const auto baseT2 = Dot<20>(shapeA.GetPoint(aside+1), Rot90(N));
    assert(baseT2 < baseT1);

    for(u32 p=0; p<pi; ++p)
    {
        const auto newT = Dot<20>(Points[p], Rot90(N));

        if (newT >= baseT1) {Points[p] = shapeA.GetPoint(aside);}
        else if (newT <= baseT2) {Points[p] = shapeA.GetPoint(aside+1);}
    }

    //It is possible to only find one point if the shape has a corner just barely touching the other. In this case, just double it for now
    //And the calling function will deal with it
    if (pi == 1) {Points[pi++] = Points[0];}
    assert(pi == 2);
    return true;
}

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
template <class T2> //Should be shape types
void ShapeCollisionTest(ContactList& contacts, CubeData* bodyA, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
    assert(bodyA && bodyB);
    const auto& shapeA = bodyA->shape();
    const auto dist = (bodyB->pos + offsetB) - bodyA->pos;

    Vec<30> bestNorm;
    auto bestPen = Fixed<20>::MAX();
    GetPenetrationSAT(bestNorm, bestPen, dist, shapeA, shapeB);
    if (bestPen.val < 0) {return;}

    Array<Vec<20>, 2> Points; //Contact points
    bool done = false;
    done = FindContactPoints(Points, -bestNorm, dist, shapeA, shapeB);

    if (!done) //Try cutting against the sides of shape B if shape A didn't work
    {
        done = FindContactPoints(Points, bestNorm, -dist, shapeB, shapeA);

        if (!done)
        {
            PRINT("!!!!!!!!!!!!!!!!!!!!!!!!!");
            PRINT(bestNorm);
            PRINT(dist);
            PRINT(shapeA.GetPoint(0));
            PRINT(shapeB.GetPoint(0));
            PRINT(cube.dir);
        }

        assert(done);

        Points[0] += dist; //FCP returns points in the reference frame of the first shape argument.
        Points[1] += dist; //Since we passed b first, we have to move them back to a
    }

    contacts.addCubeContact(bodyA, bodyB, massB, bestNorm, -Points[0], -Points[1], bestPen);
}

template <class T2> //Should be shape types
void ShapeCollisionTest(ContactList& contacts, PickoryData* bodyA, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
    assert(bodyA && bodyB);
    const auto& shapeA = bodyA->shape();
    const auto dist = (bodyB->pos + offsetB) - bodyA->pos;

    //Vec<30> bestNorm;
    UnitVec bestNorm;
    auto bestPen = Fixed<20>::MAX();
    GetPenetrationSAT(bestNorm, bestPen, dist, shapeA, shapeB);
    if (bestPen.val < 0) {return;}

    contacts.addContact(bodyA, bodyB, massB, bestNorm, bestPen);
}

template <class T2>
void DoShapeCollisionTests(ContactList& contacts, CubeData* bodyA1, PickoryData* bodyA2, PosVel* bodyB, Vec<20> offsetB, const T2& shapeB, eMassType massB)
{
    ShapeCollisionTest(contacts, bodyA1, bodyB, offsetB, shapeB, massB);
    ShapeCollisionTest(contacts, bodyA2, bodyB, offsetB, shapeB, massB);
}

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

void LineCollisionTest(ContactList& contacts, CubeData* bodyA, const Surface& s)
{
    PosVel* const bodyB = &dummy;
    assert(bodyA);

    const auto& shapeA = bodyA->shape();
    const auto T = s.dir;
    const auto N = -Rot90(s.dir);
    assert(T == Rot90(N));

    const auto BBsize = Fixed<20>(22985121 /*21.92032*/); // Half width of a bounding square for the cube

    const auto surfN = Dot<20>(s.pos, N);
    const auto cubeN = Dot<20>(bodyA->pos, N);

    //Bounding box check
    if (cubeN >= surfN + BBsize) {return;}
    if (cubeN < surfN) {return;} //If cube is already more than half way in, there's no real hope of resolving the collision, so just bail

    const auto surfT1 = Dot<20>(s.pos, T);
    const auto surfT2 = surfT1 + s.length;
    const auto cubeT = Dot<20>(bodyA->pos, T);

    //Bounding box check
    if (cubeT <= surfT1 - BBsize) {return;}
    if (cubeT >= surfT2 + BBsize) {return;}

    auto bestPen = -Fixed<20>::MAX(); //Best penetration
    Array<Vec<20>, 2> Points; //Contact points
    u32 pi = 0;

    //Clip each side of the cube to get the contact points and calculate penetration while we're at it
    for(u32 i=0; i<4 && pi < 2; ++i)
    {
        const auto nextN = cubeN + Dot<20>(shapeA.GetPoint(i), N);
        bestPen = Max(bestPen, surfN - nextN);

        //If there's an intersection, it will store in first argument and return true
        if (GetLineIntersection(Points[pi], N, surfN, bodyA->pos + shapeA.GetPoint(i-1), bodyA->pos + shapeA.GetPoint(i)))
            {   ++pi;   }
    }

    //Check if penetration is nonnegative
    if(bestPen.val < 0) {return;}

    if (pi == 1) {Points[pi++] = Points[0];}
    assert(pi == 2);

    const auto ptT1 = Dot<20>(Points[0], T);
    const auto ptT2 = Dot<20>(Points[1], T);

    //Check if the collision actually occured within the bounds of the line segment
    if ((ptT1 <= surfT1 && ptT2 <= surfT1) || (ptT1 >= surfT2 && ptT2 >= surfT2)) {return;}

    for(int pi=0; pi<2; ++pi)
    {
        const auto temp = Dot<20>(Points[pi], T);

        if (temp <= surfT1) {Points[pi] = s.pos;}
        else if (temp >= surfT2) {Points[pi] = s.pos + VecMult<20>(s.dir, s.length);}
    }

    contacts.addCubeContact(bodyA, bodyB, mtSTATIC, N, bodyA->pos - Points[0], bodyA->pos - Points[1], bestPen);
}

void LineCollisionTest(ContactList& contacts, PickoryData* bodyA, const Surface& s)
{
    PosVel* const bodyB = &dummy;
    assert(bodyA);

    const auto& shapeA = bodyA->shape();
    const auto T = s.dir;
    const auto N = -Rot90(s.dir);

    //calculate bounding box size along N and T
    const auto boxN = Abs(shapeA.GetPoint(0) * N);
    const auto boxT = Abs(shapeA.GetPoint(0) * T);

    const auto surfN = Dot<20>(s.pos, N);
    const auto kidN = Dot<20>(bodyA->pos, N);

    //Bounding box check
    if (kidN > surfN + boxN) {return;}
    if (kidN < surfN) {return;} //If kid is already more than half way in, there's no real hope of resolving the collision, so just bail

    const auto surfT1 = Dot<20>(s.pos, T);
    const auto surfT2 = surfT1 + s.length;
    const auto kidT = Dot<20>(bodyA->pos, T);

    //Bounding box check along T
    if (kidT <= surfT1 - boxT) {return;}
    if (kidT >= surfT2 + boxT) {return;}

    const auto bestPen = surfN - (kidN - boxN);
    contacts.addContact(bodyA, bodyB, mtSTATIC, N, bestPen);
}

template <class T1> //should be CubeData or PickoryData
bool EdgeChainCollisionTest(ContactList& contacts, T1* bodyA, const Surface& s1, const Surface& s2)
{
    assert(s2.pos == s1.end());
    const auto& shapeA = bodyA->shape();

    //Check if corner is convex
    if (Rot90(s1.dir) == s2.dir)
    {
        //Check if corner is inside cube
        if (shapeA.Contains(s2.pos - bodyA->pos))
        {
            const auto offset = half(s2.end() - s1.pos);
            const auto center = s1.pos + offset;
            const Rect r(offset);
            ShapeCollisionTest(contacts, bodyA, &dummy, center, r, mtSTATIC);
            return true;
        }
    }

    //If we reach this point, we didn't do a corner check
    //LineCollisionTest(contacts, bodyA, s1);

    if (portals.Linked() && s1.portal)
    {
        ArrayList<interval, 4> intvs;

        //Calculate pieces
        for(u32 ptli=0; ptli<2; ++ptli)
        {
            if (portals[ptli].surf == &s1)
            {
                intvs.add(portals[ptli].left);
                intvs.add(portals[ptli].right);
            }
        }

        for(u32 ivi = 0; ivi<intvs.size(); ++ivi)
        {
            const auto& iv = intvs[ivi];
            //if (iv.s2 == iv.s1) {continue;} //skip intervals with 0 length

            const auto offset = VecMult<20>(s1.dir, half(iv.s2 - iv.s1)) + VecMult<20>(Rot90(s1.dir), Fixed<20>(16777216 /*16.0*/));
            const auto center = s1.pos + offset + VecMult<20>(s1.dir, iv.s1);

            const Rect r(offset);
            ShapeCollisionTest(contacts, bodyA, &dummy, center, r, mtSTATIC);
        }

        if (intvs.size() > 0) {return false;}
    }

    LineCollisionTest(contacts, bodyA, s1);
    return false;
}

#176401 - ant512 - Fri Jul 22, 2011 3:55 pm

Yiminy! You know that you don't need to use the "auto" keyword, right? It's implicit.

#176402 - GLaDOS - Fri Jul 22, 2011 3:58 pm

No I'm using the C++0x auto keyword for automatic type deduction.
It makes the code more generic and easier to change.

#176403 - elhobbs - Fri Jul 22, 2011 6:11 pm

I think you are going to need to do a little profiling to see where your time is spent. nothing really exists for the ds/gba platform. there are hardware timers that you can use - although you will have to manually insert time sampling into your code.

#176404 - GLaDOS - Fri Jul 22, 2011 6:16 pm

Do you have any guides to timing? I haven't used timers before. (Or interrupts for that matter)


Also, how do you make sure that all the code and data is placed in the right section?

#176405 - Miked0801 - Fri Jul 22, 2011 6:20 pm

Do you have access to any sort of profiler? Without that, you are indeed guessing on where the slowdown is. Perhaps even an intrusive, timer based callback that stores and counts stack locations (a profiler) you could write yourself.

Looking at the code you uploaded:
Everytime I see newT, I jump and think you are allocating memory. :)
pi as a local also threw me off.
What exactly does Array<> do? Is this allocating heap or stack memory and how heavy is it? In general, dynamic memory allocations tend to get pretty heavy on performance if done in tight loops.

I'd also venture to say that this is a pretty heavy duty collision detection routine for a GBA, but if you aren't doing too much else, it's probably fine.

Overall, this is very well constructed code. Easy to follow and well asserted. :)

I see no immediately obvious things that are "slow" here outside of the possibility of dynamic memory allocations for the Array and ArrayList stuff. You minimize divides, don't sqrt, don't float, and do your work quickly. Have you considered moving some of this code to ITCM and/or making it ARM/Thumb to see if it helps? I'd also like to confirm your stack is it DTCM. Finally, do you have other code running like sound/music?

#176407 - Dwedit - Fri Jul 22, 2011 6:51 pm

I made a modified version of VisualBoyAdavnce that does simple profiling (counts the number of times each memory address was executed). For ARM instructions only.

I use it with some other programs I made that put it against a disassembly.

I usually found that inner loops account for by far the most time of any program.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176408 - GLaDOS - Fri Jul 22, 2011 7:07 pm

I don't have any sound or music code. I figured that is something I can worry about after I get the rest of the game working.

Here's my array classes. I don't do any dynamic memory allocations.

The ArrayList is basically just a plain array that also keeps track of how many elements have been stored. I had troubling deciding what to name it but I eventually decided ArrayList would be the least confusing name.

Code:
#pragma once
#include "ints.h"
#include "macros.h"

template <typename T, u32 N>
class Array
{
    T arr[N];

    public:

    T& operator[](u32 i) {assert(i < size()); return arr[i];}
    const T& operator[](u32 i) const {assert(i < size()); return arr[i];}
    static u32 size() {return N;}
};

template <typename T, u32 N>
class ArrayList
{
    T arr[N];
    u32 m_size;

    public:
    ArrayList() : m_size(0) {}

    T& operator[](u32 i) {assert(i < size()); return arr[i];}
    const T& operator[](u32 i) const {assert(i < size()); return arr[i];}

    u32 size() const {return m_size;}
    u32 capacity() const {return N;}

    void clear() {m_size = 0;}

    void add(const T& val)
    {
        assert(m_size < N);
        if (m_size < N) {arr[m_size++] = val;}
    }

    typedef T* iterator;
    typedef const T* const_iterator;
    const T* begin() const {return arr;}
    const T* end() const {return arr+size();}
    T* begin()  {return arr;}
    T* end()  {return arr+size();}

    //These functions are useful for adding elements in place
    //T& next() {assert(m_size < N); return arr[m_size];}
    //void sizeinc() {m_size++;}
};

template <typename T>
struct ArrayConstPtr
{
    const T* arr;
    u32 N;

    //T& operator[](u32 i) {assert(i < size()); return arr[i];}
    const T& operator[](u32 i) const {assert(i < size()); return arr[i];}
    u32 size() const {return N;}

    typedef const T* iterator;
    typedef const T* const_iterator;
    const T* begin() const {return arr;}
    const T* end() const {return arr+size();}
};

template <typename T> auto begin(T& container) -> decltype(container.begin()) {return container.begin();}
template <typename T> auto end(T& container) -> decltype(container.end()) {return container.end();}
template <typename T> auto begin(const T& container) -> decltype(container.begin()) {return container.begin();}
template <typename T> auto end(const T& container) -> decltype(container.end()) {return container.end();}

#176409 - Dwedit - Fri Jul 22, 2011 7:26 pm

You do have optimizations turned on, right?
You could try defining NDEBUG to remove all asserts from the generated code.

Otherwise, find out what's executed the most (with a profiler), build it as ARM code and move it to IWRAM.

Also, look at the disassembly of your code to see how good it is.

If you send me the .ELF file, I'll try to profile it.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176410 - GLaDOS - Fri Jul 22, 2011 7:30 pm

Yes, I have O2 on. I already posted my compiler flags in the first post.
And the assert is already just defined as nothing. I have no idea how to code a useful assert statement for the GBA anyway.

As far as looking at the assembly, I don't know how to do that. Do you have any guides? Also, how do I send you the elf?

Code:
#pragma once

#define STATICASSERT(x) static_assert((x), #x)

#ifdef TESTING
    #include <cassert>
    #define PRINT(x) std::cout << #x ": " << (x) << '\n'
#else
    #define PRINT(x)
    #define assert(x)
#endif

#176411 - Dwedit - Fri Jul 22, 2011 7:36 pm

Easy way to do disassembly:
arm-eabi-objdump -d filename.elf > outputfile.s

If you are using the example makefiles, you get an ELF file before it makes a GBA file. The ELF file is basically the same as the GBA file since the code is already relocated to its final addresses, but it still has the symbols intact.

Right now, my profiling tools are very limited, you need to build it as ARM mode instead of thumb mode, but they still work okay.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176412 - GLaDOS - Fri Jul 22, 2011 7:40 pm

I have the elf file, I was just wondering how to send it to you.

Edit: I figured out that dekitpro was making a multiboot image by default. I tried changing it, but there is no visible effect on speed.


Last edited by GLaDOS on Fri Jul 22, 2011 8:08 pm; edited 2 times in total

#176413 - Dwedit - Fri Jul 22, 2011 7:43 pm

Just post it on my board somewhere, I allow guests to upload files.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176414 - GLaDOS - Fri Jul 22, 2011 7:51 pm

I tried to post it but the attachment won't show up.

#176415 - Dwedit - Fri Jul 22, 2011 8:13 pm

Sorry, forgot to turn off the 100k size limit... I increased it to 10MB.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176417 - GLaDOS - Sat Jul 23, 2011 1:36 am

I added a rudimentary by recording the scanline counter at certain points and drawing a sprite at that location. It's crude but at least it's better than nothing.

I found some divisions in my contact solving code that seem unnecessary. I tried to replace the 3 divisions with 1 division followed by multiplication by the reciprocal. There was a large speedup, but unfortunately, the physics became very unstable.
Presumably there were large rounding errors introduced by taking the reciprocal, though I'm not sure why. Even in the worse case scenario, it should have a relative error of at most 1/139665.

Code:
            //Construct inverse matrix
            const auto cc = Mult<20>(c, c);         // max 1922, min .25
            const auto det = Mult<19>(cc, invM);    // 1922*2 to .25
            const auto iim = Mult<24>(I, invM);     // 40 to 80

            //Multiplying by reciprocal is faster than division
            /*const auto invDet = Div<28>( Fixed<30>(1073741824), det );

            const auto AInv11 = Mult<19>( Add<21>(iim, bb), invDet);
            const auto AInv12 = Mult<19>( Add<21>(iim, ab), invDet);
            const auto AInv21 = AInv12;
            const auto AInv22 = Mult<19>( Add<21>(iim, aa), invDet);*/

            const auto AInv11 = Div<19>( Add<21>(iim, bb), det);   // max 2242
            const auto AInv12 = -Div<19>( Add<21>(iim, ab), det);
            const auto AInv21 = AInv12;
            const auto AInv22 = Div<19>( Add<21>(iim, aa), det);

#176418 - Dwedit - Sat Jul 23, 2011 4:59 am

I can see that you are using the "Noob method" of waiting for vblank, one that repeatedly checks VCOUNT inside a while loop. It sticks out like a sore thumb in the profile. You should replace it with a call to VBlankIntrWait(), which halts the processor until an interrupt happens, and uses less power.

Looks like it spends lots of time in __aeabi_lmul and __aeabi_uidiv, which are slow ways of doing multiplication and division.

Normally multiplying two 32 bit numbers into a 64 bit number is a single ARM instruction that takes 7 cycles to complete, not a call to a 36 instruction long THUMB multiply function. If you can, find a way to eliminate 64 bit x 64 bit multiplication, and get it down to 32 bit x 32 bit.
Also, if you are dividing by the same number a lot, find out its reciprocal, then multiply by it instead. So you take 0x100000000 (a 64 bit number), and do the division using some slower division code. You get the reciprocal. You can then use 32x32 to 64 bit multiplication to divide, just by discarding the low 32 bits of your result.

I suggest you try to avoid using THUMB mode, since it just makes horrible choices of how to do multiplication and division.

Also, the program fails to run on NO$GBA, but runs fine on VisualBoyAdvance.

If you still can't get the compiler to generate good code for multiplying, you can use inline assembly to force it to use the MULL/UMULL instruction.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176419 - GLaDOS - Sat Jul 23, 2011 6:00 am

Wait, so if I have a line like
s64 p = (s64)a.val * (s64)b.val;
should I just get rid of the casts?

Also, I already got rid of several divisions since I posted that elf. The remaining ones are either impossible to get rid of or cause bugs when removed (I'm looking into that at the moment.)

And sorry about the noob thing. I'm not very experienced with GBA coding. I've never even used an interrupt before.
By the way, do you know why it doesn't run on No$GBA? I use VBa so I didn't notice.


EDIT: I tried getting rid of the s64 casts in the multiplication function, and everything completely broke, presumably due to overflow. If I get rid of only one of the casts, then it works like normal but there is also no speed increase.

#176420 - Dwedit - Sat Jul 23, 2011 7:35 am

Build in ARM mode, not THUMB mode.

This little piece of code here:
Code:

s32 test(s32 a, s32 b)
{
   s64 value = (s64)a * (s64)b;
   return value >> 32;
}

In ARM mode, it builds to this:

Code:

test:
   str   r4, [sp, #-4]!
   smull   r3, r4, r1, r0
   mov   r0, r4
   ldmfd   sp!, {r4}
   bx   lr


Not too bad, but the compiler could have done this better, eliminate the mov, and eliminate the stack push/pop.

In THUMB mode, it builds to this:
Code:

test:
   push   {r3, lr}
   mov   r2, r0
   asr   r3, r2, #31
   mov   r0, r1
   asr   r1, r1, #31
   bl   __aeabi_lmul
   @ sp needed for prologue
   mov   r0, r1
   pop   {r3}
   pop   {r1}
   bx   r1


So you want to avoid THUMB mode for anything that does any multiplication, otherwise it calls some slow multiply function to actually do the work, instead of simply using the SMULL instruction.

Using reciprocals for signed division is a tad problematic, you should work with absolute values, and set the sign at the end. Otherwise you could end up with problems such as negative numbers off by 1.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176422 - GLaDOS - Sat Jul 23, 2011 5:06 pm

I changed the options from -mthumb -mthumb-interwork to -marm and it's now much faster. Thanks!

It now lags only in complicated situations.

#176423 - Miked0801 - Sat Jul 23, 2011 5:21 pm

Ah the beuaty of profiling, showing exactly where one needs to improve things :)

#176424 - GLaDOS - Sun Jul 24, 2011 3:38 pm

I figured out why the game broke when I change the division with multiplication by reciprocal.

It turns out to be a really stupid mistake. I accidentally got rid of a minus sign while changing the divisions to multiplications.

#176425 - ant512 - Mon Jul 25, 2011 8:59 am

GLaDOS wrote:
No I'm using the C++0x auto keyword for automatic type deduction.
It makes the code more generic and easier to change.


Oh yes! They've re-purposed "auto" to work like C#'s "var".

#176427 - Dwedit - Mon Jul 25, 2011 4:30 pm

In fact, you can just "#define var auto" if you really like the var keyword.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176555 - Exophase - Wed Aug 17, 2011 5:39 pm

Also note that VBA tends to have worse timings than the real thing. If you run the game on hardware you might find that it's faster.

#176556 - GLaDOS - Wed Aug 17, 2011 5:48 pm

Most people will probably be playing it on an emulator though since I don't exactly have the resources to manufacture cartridges.

#176563 - GLaDOS - Thu Aug 18, 2011 5:12 am

Update: Once again, my game is running too slow. I think one possible target for optimization is my portal spark code.
It is intended to create an animated spark effect coming out of portals when they are linked. Each spark has a random start position, velocity, and lifetime.

Code:

class Xorshift32
{
    u32 val;

    public:
    Xorshift32(u32 seed=0x92D68CA2): val(seed) {}

    u32 GetNext()
    {
        val ^=(val<<13);
        val ^=(val>>17);
        val ^=(val<<5);

        assert(val != 0);
        return val;
    }

    //GetNext always returns nonzero. Shift so it is never intmin
    s32 GetSigned() {   return (GetNext() + 0x80000000u);  }
};

struct SparkData{
    Vec<20> pos;
    Vec<20> vel;
    u32 time;
};

Array<ArrayList<SparkData, 16>, 2> sparkdata;
Xorshift32 prng;

void updateSparks(u32 pi)
{
    auto& data = sparkdata[pi];

    for(u32 i = 0; i < data.size();)
    {
        if (data[i].time)
        {
            data[i].pos += data[i].vel;
            --data[i].time;
            ++i;
        }
        else {data.swapremove(i);} //Don't increment i if we're removing an element or else we'll skip one
    }

    if (portals.Linked() && data.size() < data.capacity())
    {
        //CreateSpark(myprng.GetRange(-HALFWIDTH(),HALFWIDTH()), -2, myprng.GetRange(.7,1.3), 5 + myprng.GetNext() % 23);
        const auto px = prng.GetSigned() >> 6; // -32.0 to 32.0
        const auto vy = 393216 /*0.375*/ + (s32)(prng.GetNext() >> 14); //0.375 to 0.625

        //generate number from 0 to 44 without using modulo. 0 - 18 are slightly more likely
        //empirically, it doesn't appear to make any difference in speed though, probably because it only runs once or twice per tick
        u32 t = prng.GetNext() >> 26;
        if (t >= 45) {t -= 45;}

        SparkData s;
        s.pos = portals[pi].ToWorld( Vec<20>(px, 0) );
        s.vel = portals[pi].Transform( Vec<20>(0, vy) );
        s.time = 10u + t;
        data.add(s);
    }
}

#176567 - Dwedit - Thu Aug 18, 2011 5:51 pm

Break out the profiler again and see what still needs optimization?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176568 - GLaDOS - Thu Aug 18, 2011 5:52 pm

How do I profile it?

#176569 - Dwedit - Thu Aug 18, 2011 6:29 pm

Tools needed:
DevKitArm (for arm-eabi-objdump)
VBA_TEH_PROFILER.EXE
profiler_merge.exe
TextPad Text Editor (because it has a Sort feature)

Start with your ELF file. Disassemble it to ASM:

arm-eabi-objdump -D portal.elf > portal.s

Run your GBA file in a special version of VBA that I built, called "VBA_TEH_PROFILER". Run it for a short time, 2 minutes is plenty, making sure to cover all the code you're interested in checking. Then save the profile log using "File > Save Profile Data".

Run "profiler_merge" to merge the instruction execution count with the disassembly.

profiler_merge portal.s portal.profile portal_output.s

Now you have some huge file that looks like this:
Code:

00001E30 08000304 <_Z17NeedToChangeBlocki>:
00001E30  8000304:   b570         push   {r4, r5, r6, lr}
00001E30  8000306:   4b14         ldr   r3, [pc, #80]   ; (8000358 <_Z17NeedToChangeBlocki+0x54>)
00001E30  8000308:   681a         ldr   r2, [r3, #0]
00001E30  800030a:   2001         movs   r0, #1
00001E30  800030c:   2a00         cmp   r2, #0

... snipped ...

00001E26  8000348:   428d         cmp   r5, r1
00001E26  800034a:   db00         blt.n   800034e <_Z17NeedToChangeBlocki+0x4a>
00001E26  800034c:   2000         movs   r0, #0
00001E26  800034e:   0600         lsls   r0, r0, #24
00001E2B  8000350:   0e00         lsrs   r0, r0, #24
00001E2B  8000352:   bc70         pop   {r4, r5, r6}
00001E30  8000354:   bc02         pop   {r1}
00001E30  8000356:   4708         bx   r1

Left column is the number of times that instruction was executed. (it's approximate for THUMB instructions though, it counts both halves of the word as the same address, but it's accurate for ARM instructions)
Middle column is the address in ROM of the code.
Rest of the line is the disassembly.

Now get TextPad, a great text editor. It can sort the file. If you sort the text by column 1 in descending order, it will put all the most executed code at the top of the file. Of course, everything will be in semi-random order after a sort, so you can search the original file for the address if you want to see the code near there.

You usually see that inner loops are much hotter than their surrounding code, and they tend to need optimization the most.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176570 - GLaDOS - Thu Aug 18, 2011 6:57 pm

Here's the first few lines of the results. I'm guessing that the first 3 are the busy waiting part of the game loop (haven't tried interrupts yet). But that won't affect speed for obvious reasons.

Code:
03072536  80071bc:   9afffffc    bls   80071b4 <main+0xe4>
03072536  80071b8:   e353009f    cmp   r3, #159   ; 0x9f
03072536  80071b4:   e1d430b6    ldrh   r3, [r4, #6]
00D020CE  80071b0:   8afffffc    bhi   80071a8 <main+0xd8>
00D020CE  80071ac:   e353009f    cmp   r3, #159   ; 0x9f
00D020CE  80071a8:   e1d430b6    ldrh   r3, [r4, #6]
000E3494  800048c:   1affffe4    bne   8000424 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x60>
000E3494  8000488:   e1520006    cmp   r2, r6
000E3494  8000484:   e5932008    ldr   r2, [r3, #8]
000A426B  8000428:   0a00001e    beq   80004a8 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0xe4>
000A426B  8000424:   e1520005    cmp   r2, r5
0008DB54  8006008:   1afffffb    bne   8005ffc <memcpy+0xc0>
0008DB54  8006004:   e4c3c001    strb   ip, [r3], #1
0008DB54  8006000:   e2522001    subs   r2, r2, #1
0008DB54  8005ffc:   e4d1c001    ldrb   ip, [r1], #1
0007A0F9  8000434:   0a000009    beq   8000460 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x9c>
0007A0F9  8000430:   e3510000    cmp   r1, #0
0007A0F9  800042c:   e5971000    ldr   r1, [r7]
00078766  8000480:   0a000015    beq   80004dc <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x118>
00078766  800047c:   e58a3000    str   r3, [sl]
00078766  8000478:   e1a04003    mov   r4, r3
00078766  8000474:   e58c2010    str   r2, [ip, #16]
00078766  8000470:   e1580003    cmp   r8, r3
00078766  800046c:   e0812002    add   r2, r1, r2
00078766  8000468:   e2843018    add   r3, r4, #24
00078766  8000464:   e59c1010    ldr   r1, [ip, #16]
00078766  8000460:   e5932010    ldr   r2, [r3, #16]
0006F085 080003c4 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb>:
0006F085  80004a4:   e12fff1e    bx   lr
0006F085  80004a0:   e8bd0ff0    pop   {r4, r5, r6, r7, r8, r9, sl, fp}
0006F085  8000420:   ea000017    b   8000484 <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0xc0>
0006F085  800041c:   e269b000    rsb   fp, r9, #0
0006F085  8000418:   e2656000    rsb   r6, r5, #0
0006F085  8000414:   e59c9008    ldr   r9, [ip, #8]
0006F085  8000410:   e59c500c    ldr   r5, [ip, #12]
0006F085  800040c:   0a000032    beq   80004dc <_Z13SurfaceMergerR7SurfaceRPKS_S2_RK13PortalManagerb+0x118>
0006F085  8000408:   e1a03004    mov   r3, r4
0006F085  8000404:   e1580004    cmp   r8, r4
0006F085  8000400:   158c3010    strne   r3, [ip, #16]
0006F085  80003fc:   18850003    stmne   r5, {r0, r1}
0006F085  80003f8:   13a03000    movne   r3, #0
0006F085  80003f4:   18960003    ldmne   r6, {r0, r1}
0006F085  80003f0:   18a5000f    stmiane   r5!, {r0, r1, r2, r3}
0006F085  80003ec:   11a0500c    movne   r5, ip
0006F085  80003e8:   18b6000f    ldmne   r6!, {r0, r1, r2, r3}
0006F085  80003e4:   e1a08002    mov   r8, r2
0006F085  80003e0:   e1a0a001    mov   sl, r1
0006F085  80003dc:   e1a07003    mov   r7, r3
0006F085  80003d8:   e1a0c000    mov   ip, r0
0006F085  80003d4:   11a06004    movne   r6, r4
0006F085  80003d0:   e35c0000    cmp   ip, #0
0006F085  80003cc:   e5914000    ldr   r4, [r1]
0006F085  80003c8:   e5ddc020    ldrb   ip, [sp, #32]
0006F085  80003c4:   e92d0ff0    push   {r4, r5, r6, r7, r8, r9, sl, fp}
000671DA  8005de0:   1affffef    bne   8005da4 <__aeabi_uidiv+0x58>
000671DA  8005ddc:   11a01221    lsrne   r1, r1, #4
000671DA  8005dd8:   11b03223    lsrsne   r3, r3, #4
000671DA  8005dd4:   e3500000    cmp   r0, #0
000671DA  8005dd0:   218221a3    orrcs   r2, r2, r3, lsr #3
000671DA  8005dcc:   204001a1    subcs   r0, r0, r1, lsr #3
000671DA  8005dc8:   e15001a1    cmp   r0, r1, lsr #3
000671DA  8005dc4:   21822123    orrcs   r2, r2, r3, lsr #2
000671DA  8005dc0:   20400121    subcs   r0, r0, r1, lsr #2
000671DA  8005dbc:   e1500121    cmp   r0, r1, lsr #2
000671DA  8005db8:   218220a3    orrcs   r2, r2, r3, lsr #1
000671DA  8005db4:   204000a1    subcs   r0, r0, r1, lsr #1
000671DA  8005db0:   e15000a1    cmp   r0, r1, lsr #1
000671DA  8005dac:   21822003    orrcs   r2, r2, r3
000671DA  8005da8:   20400001    subcs   r0, r0, r1
000671DA  8005da4:   e1500001    cmp   r0, r1


It looks like SurfaceMerger is a good place for optimization, but I'm not sure what to do about it. Here's the source.

Code:
enum class BreakType {LISTEND, CONCAVE, CONVEX, PORTAL_MID, PORTAL_END};
//Automatically merge adjacent surfaces and return type of break at the end
BreakType SurfaceMerger(Surface& s, const Surface*& sp, const Surface* end, const PortalManager& portals, bool reset)
{
    assert(sp && end);
    assert(reset || (sp == end) || (s.end() == sp->pos));

    if (reset) //If the next surface might be a portal, we want to resuse that code. So create a sort of stub surface and continue as normal
    {
        assert(sp != end);
        s = *sp;
        s.length.val = 0;
    }

    while (sp != end)
    {
        if (-Rot90(s.dir) == sp->dir) {return BreakType::CONCAVE;}
        if (Rot90(s.dir) == sp->dir) {return BreakType::CONVEX;}

        assert(s.dir == sp->dir);
        if (portals.Linked() && sp->portal())
        {
            //If next surface is split up by portals, merge only the first portion and return appropriately
            if (portals[0].surf == sp && portals[1].surf == sp)
            {
                if (portals[0].left.s1.val == 0) {s.length += portals[0].left.s2;}
                else {assert(portals[1].left.s1.val == 0); s.length += portals[1].left.s2;}
                return BreakType::PORTAL_MID;
            }
            else if (portals[0].surf == sp)
            {
                s.length += portals[0].left.s2;
                return BreakType::PORTAL_END;
            }
            else if (portals[1].surf == sp)
            {
                s.length += portals[1].left.s2;
                return BreakType::PORTAL_END;
            }
        }

        s.length += sp->length;
        ++sp;
    }

    return BreakType::LISTEND;
}

#176571 - Dwedit - Thu Aug 18, 2011 9:16 pm

Don't just look at the sorted code, look back at the original code too, especially whenever the execution count changes, so you can see what function it's part of. Just because some labels appear in the sorted toplist doesn't mean they are necessarily smoking guns.

And if you can't optimize something, call it less often. For example, for collision detection, there are ways to reduce the number of collision checks by maintaining separate object lists.

edit: also, what's up with the byte-based memcpy I see there? Which data isn't word aligned?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176572 - GLaDOS - Thu Aug 18, 2011 10:09 pm

The problem is that I don't know assembly so I can't really tell what's going on.

#176573 - Dwedit - Thu Aug 18, 2011 10:50 pm

I can't even read compiler output very well, the labels suck, and there's no comments.
But you don't really need to read ASM, just look for the function, and see which loop seems to be hot. They correspond roughly to the original C++ source.

By the way, what exactly does surface merging do? And can you figure out how to call the function less often?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176574 - GLaDOS - Thu Aug 18, 2011 11:00 pm

Well when you have adjacent surfaces, you don't want to create a separate contact for each one because it's slower and less accurate.

I suppose one way to avoid that is to maintain a list of surfaces in memory, assuming there's enough space for that.

#176575 - Dwedit - Thu Aug 18, 2011 11:36 pm

Also, you can speed up your program by eliminating the explicit Vblank wait if you've already hit vblank. Then you can frame rates that continuously vary up to 60, instead of just 60 or 30.

I think sprite uploading still has to be done during Vblank time though, so use IRQs for that. Use triple buffering for the sprite tables, one is the physical OAM, one "to be copied to OAM", and one "to be written to". Swap the "one to copy" and "one to write to" after every frame is finished calculating. (this is a pointer swap, not a data copy)
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176576 - GLaDOS - Thu Aug 18, 2011 11:41 pm

Variable framerates are a nightmare to work with and rather pointless as well since everything is visible at an integral multiple of the refresh rate anyway.


I guess what I really need is a more efficient way to do collision detection but I'm having trouble coming up with a solution.

#176580 - Dwedit - Fri Aug 19, 2011 10:40 pm

By the way, the same code usually runs 4-8 times as fast when moved into IWRAM, vs executing the code from ROM. So see what you can move there.

You can also use Video RAM for fast code execution, but Devkitarm doesn't directly support it.

If you have any lookup tables with 8 or 16 bit values, stick them in VRAM too.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."


Last edited by Dwedit on Fri Aug 19, 2011 10:47 pm; edited 1 time in total

#176581 - GLaDOS - Fri Aug 19, 2011 10:41 pm

How do you tell what's in ROM and what's in IWRAM? Also, how do you tell how much free space in IWRAM you have? It might be especially tricky figuring out the maximum stack size.

#176582 - Dwedit - Fri Aug 19, 2011 11:13 pm

GBATEK has a table showing how slow each type of memory is:
Code:

  Region        Bus   Read      Write     Cycles
  BIOS ROM      32    8/16/32   -         1/1/1
  Work RAM 32K  32    8/16/32   8/16/32   1/1/1
  I/O           32    8/16/32   8/16/32   1/1/1
  OAM           32    8/16/32   16/32     1/1/1 *
  Work RAM 256K 16    8/16/32   8/16/32   3/3/6 **
  Palette RAM   16    8/16/32   16/32     1/1/2 *
  VRAM          16    8/16/32   16/32     1/1/2 *
  GamePak ROM   16    8/16/32   -         5/5/8 **/***
  GamePak Flash 16    8/16/32   16/32     5/5/8 **/***
  GamePak SRAM  8     8         8         5     **


"GamePak ROM" costs 8 cycles to execute an ARM instruction, and 5 cycles to execute a THUMB instruction. That's why people usually build things in THUMB mode, since it runs code faster from the slowest kind of memory.
But IWRAM costs 1 cycle to execute an ARM instruction, so that's the best place to put ARM code.

But we've seen that GCC acts stupidly in THUMB mode, and generates very crappy code if you do any long multiplication.

You can tell the DevKitARM magic makefiles which mode you want your code generated for by renaming the source file, so "File.thumb.cpp" makes THUMB code, but "File.arm.cpp" makes ARM code.

All your code is in ROM, unless you put it somewhere else. Use "IWRAM_CODE" to mark a function as one to move to IWRAM. There are also filename features for ".iwram.cpp", but they don't actually work, instead it just does the same as ".arm.cpp"

Global variables go into the BSS section, which is in IWRAM.
Heap variables (allocated with the New operator or malloc) go into EWRAM.
The stack is in IWRAM.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176583 - GLaDOS - Fri Aug 19, 2011 11:15 pm

What happens if you try to put too much code in IWRAM? Is it just a linker error? Or do you get a runtime stack overflow or what?

#176584 - Dwedit - Fri Aug 19, 2011 11:21 pm

If you put too much code in RAM, you get link errors. If you put barely too much code that it's still within the 32k limit, but don't have enough space for the stack actually used, your global variables and RAM code eventually gets clobbered by the stack, and it crashes.

To check how much memory your code is using, loiok at the .map file in the build directory.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176585 - GLaDOS - Sat Aug 20, 2011 3:46 am

I rewrote the searching algorithm and it's much faster now. Before, it did a brute force search against even surface in the level. Now I divide up the level into 128x128 blocks and only search the surfaces in the same block as the shape being tested. The level data now takes up more space due to all the extra data structures involved but its faster at runtime, and I don't think there's any shortage of ROM.

I'd estimate that the collision search phase is now twice as fast as it was before. And the speedup factor will only increase as I larger and more complicated levels.

#176592 - GLaDOS - Sun Aug 21, 2011 6:17 pm

Another question: It is worth disabling windows when I'm not using them? I'm planning on using windows for the level transitions, but they won't be used during normal gameplay.

#176593 - Dwedit - Sun Aug 21, 2011 9:40 pm

As long as you're displaying stuff correctly, it doesn't matter whether windows are enabled or not. In fact, windows are the recommended way to disable background layers on a scanline-by-scanline basis.

Remember that window parameters can be changed each scanline, so your effect doesn't need to be a simple rectangle. One of the simplest scanline-by-scanline window effects is to make a circle instead of a rectangle. But you can make it any shape you want, as long is there's only one hole in the middle. If you need even more possible shapes, you can use the Object Window feature, where sprites become windows.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176612 - GLaDOS - Fri Aug 26, 2011 9:01 pm

Ug, I made a new level with some more features and now the game is too slow again. It looks like the main culprit is the collision detection phase, which is now taking almost half the frame for no apparent reason.


Also, I've done the profiling again, but I still have no clue how to make sense of the data.

#176676 - Dwedit - Tue Sep 13, 2011 10:07 pm

By the way, you can do "arm-eabi-objdump -dS file.elf > output.s" to get source code annotations with the disassembly. I just read about this.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#176677 - Miked0801 - Tue Sep 13, 2011 11:36 pm

How do you handle your collision detection and how many objects are you collding with?