gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Triangle Sort in ASM how to improve ???

#18815 - batblaster - Tue Apr 06, 2004 12:48 am

Hi !

This is my triangle sort routine in ASM , i've maded all routine in asm like rotation and fill but i think the sort is not very fast, i'm very happy if someone can help me on make it faster...

Code:


@-----------------------------------------------------------------------
@
@   Triangle sort , this routine make the sort of triangle faces
@   to be filled
@
@extern void TriangleSort(PTriangolo,void*,u32 ntriangoli);
@
@PTriangolo is a structure
@-----------------------------------------------------------------------

.global TriangleSort
.arm
.align 2
.section .iwram @, "ax", %progbits

@    for (i=0;i<ntriangoli;i++)
@    {
@        z[i]=(((PPunto3D) &Punti_rotati[(&triang[i])->p1])->z+
@         ((PPunto3D) &Punti_rotati[(&triang[i])->p2])->z+
@         ((PPunto3D) &Punti_rotati[(&triang[i])->p3])->z);
@    }

TriangleSort:
    stmfd   sp!, {r4-r11, lr}

@Start of the routine who create the Z array, the same of the "C" version showed above

    mov     r5,#0
    mov     r10,#8
    mov     r11,#12
    mov     r14,r1
srtloop:
    mov     r12,r5,lsl #4
    ldr     r6,[r0,r12]
    add     r12,r12,#4
    ldr     r7,[r0,r12]
    add     r12,r12,#4
    ldr     r8,[r0,r12]
    mla     r9,r6,r11,r10
    mla     r6,r7,r11,r10
    mla     r7,r8,r11,r10
    ldr     r9,[r3,r9]
    ldr     r6,[r3,r6]
    ldr     r7,[r3,r7]
    add     r8,r9,r6
    add     r8,r8,r7
    str     r8,[r14],#4
    add     r5,r5,#1
    cmp     r5,r2
    bmi     srtloop
@End of routine who make Z Array

    mov     r4,#1           @i=1
sortloop:
    mov     r5,r2,lsl #2           @numtrinagoli in r5
sortloop2:
    sub     r5,r5,#4        @numtriangoli will be -1 this is j
    cmp     r5,r4           @compare j with i
    bmi     out             @if is < i will go out else >= go on

    ldr     r6,[r1,r5]     
    sub     r5,r5,#4        @j-1
    ldr     r7,[r1,r5]     
    cmp     r6,r7           @if z[j]>z[j-1]
    strgt   r6,[r1,r5]     
    addgt   r5,r5,#4     
    strgt   r7,[r1,r5]
    movgt   r5,r5,lsl #2
    ldrgt   r6,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r7,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r8,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r9,[r0,r5]
    subgt   r5,r5,#12+16
    ldrgt   r10,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r11,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r12,[r0,r5]
    addgt   r5,r5,#4
    ldrgt   r14,[r0,r5]

    strgt   r9,[r0,r5]
    subgt   r5,r5,#4
    strgt   r8,[r0,r5]
    subgt   r5,r5,#4
    strgt   r7,[r0,r5]
    subgt   r5,r5,#4
    strgt   r6,[r0,r5]
    addgt   r5,r5,#16
    strgt   r10,[r0,r5]
    addgt   r5,r5,#4
    strgt   r11,[r0,r5]
    addgt   r5,r5,#4
    strgt   r12,[r0,r5]
    addgt   r5,r5,#4
    strgt   r14,[r0,r5]
    subgt   r5,r5,#12
    movgt   r5,r5,asr #2
    b       sortloop2
out:
    add     r4,r4,#1
    cmp     r4,r2
    bmi     sortloop

    ldmfd   sp!, {r4-r11, lr}
    bx      lr


THIS IS THE "C" Version of my sort, this is only the 2nd step the 1st one was showed above

   for (i=1;i<ntriangoli;i++)
   {
      for (j=ntriangoli-1;j>=i;j--)
      {
         if (z[j]>z[j-1])
         {
            z1=z[j];
            z[j]=z[j-1];
            z[j-1]=z1;
            t1=triang[j-1];
            triang[j-1]=triang[j];
            triang[j]=t1;
         }
      }
   }



The ASM version is faster then the "C" version but i think not to much, i've tried to make some changes like this:

Code:


    mov     r4,#1           @i=1
sortloop:
    mov     r5,r2,lsl #2           @numtrinagoli in r5
sortloop2:
    sub     r5,r5,#4        @numtriangoli will be -1 tutto this is j
sortloop3:
    cmp     r5,r4           @compare j with i
    bmi     out             @if is < i will go out else >= go on

    ldr     r6,[r1,r5]     
    sub     r5,r5,#4        @j-1
    ldr     r7,[r1,r5]     
    cmp     r6,r7           @if z[j]>z[j-1]
    ble     sortloop3

rest of code some of before



But i don't know is slowly then the other...

Any idea on how to improve ???

To sort,rotate,fill a 86 vertex and 168 triange i take 1 entire cycle...

I know is possible to make it better but really i don't know...

Thanks a lot to DekuTree64 to help me and thanks to anyone who make all the docs i found on the arm asm intructions...

Thanks to anyone in advance...

P.S. Sorry for nomuch comments and sorry for some comment in a very bad english...
_________________
Batblaster / 7 Raven Studios Co. Ltd
------------------------------------------

#18825 - Lupin - Tue Apr 06, 2004 12:23 pm

http://ciips.ee.uwa.edu.au/~morris/Year2/PLDS210/qsort.html
_________________
Team Pokeme
My blog and PM ASM tutorials

#18826 - DekuTree64 - Tue Apr 06, 2004 3:54 pm

Sorry I never got a chance to reply to your mail about this, I've been a little busy lately.
Quicksort would be best, but here's a litle improvement to bubble sort that's easy to do. It should still work looping backward, but I'll stick with the one I know works

Code:
for(i = 0; i < count; i++)
{
   int idx = i;
   for(j = i+1; j < count; j++)
   {
      if(z[j] > z[idx])
         idx=j;
   }
   if(idx != i)
      //swap
}

_________________
___________
The best optimization is to do nothing at all.
Therefore a fully optimized program doesn't exist.
-Deku

#18863 - Torlus - Wed Apr 07, 2004 9:30 am

You should consider Radix sort instead. Simple to code, and much more efficient for the purpose of triangle sorting.
_________________
GBA,GC,NGPC,GP32,FPGA,DS stuff at http://torlus.com/

#18867 - batblaster - Wed Apr 07, 2004 12:45 pm

Thank you to anyone... I will try to make it in asm and hope will be faster hehehe...

If someone doit before me please post here...

Thanks.....
_________________
Batblaster / 7 Raven Studios Co. Ltd
------------------------------------------

#18877 - poslundc - Wed Apr 07, 2004 6:42 pm

Radix sort requires the use of a supplementary sorting algorithm. It is probably best when used in conjunction with counting sort, although the effectiveness of the algorithm will be determined by your available memory.

If you intend to use radix sort, though, make sure your sort-digits are a power of two, and not a power of ten.

Dan (uses insertion sort for almost everything).

#18910 - Miked0801 - Thu Apr 08, 2004 2:01 am

Shell sort is another nice alternative that uses less RAM/Stack the QUick and for smaller data sets, is comperable or faster. It's what I use for sorting our actors.

#18973 - tepples - Thu Apr 08, 2004 10:32 pm

Shell sort: O(n^1.25). I have used Shell sort (decreasing gap insertion sort) for an NES game, and it worked fine. I was even able to implement it in 6502 assembly language in fewer than a half dozen tries.

Heapsort: O(n*log(n)). Approaches the average case efficiency of Quicksort for large data sets.

Both Shell sort and Heapsort can be implemented either iteratively or with only tail recursion. Quicksort requires non-tail recursion and thus requires extra stack space. Quicksort can also degrade to O(n^2) in the worst case if your pivot choices work against you (common on sorted data unless you use middle-of-domain pivot); that won't happen with Shell sort or Heapsort.

See Wikipedia's Quicksort article for details.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#20504 - sasq - Wed May 12, 2004 8:47 am

If this is for the purpose of drawing inconvex objects, have you thought about turning the object into a BSP-tree instead? No sorting, perfect rendering with no glitches...

#20513 - Lupin - Wed May 12, 2004 12:08 pm

BSP trees are a good idea, but you will still have to use sorting for models and such... or just use a backbuffer (will be way slower though).
_________________
Team Pokeme
My blog and PM ASM tutorials

#20521 - Daniel Andersen - Wed May 12, 2004 2:07 pm

tepples wrote:
Both Shell sort and Heapsort can be implemented either iteratively or with only tail recursion. Quicksort requires non-tail recursion and thus requires extra stack space.


Are you even sure that GCC supports tail recursion? I'm not. Anyway, it is not always possible for it to implement tail recursion dependent of how you use the local stack frame, e.g. passing on pointers to local variables etc. And since C is (very) weakly typed GCC cannot be sure with any pointers; for instance, passing on an integer in a cal in tail position might just be a pointer into the local stack frame and thus this cannot be destroyed, but there is no way for GCC to tell that.

True, it can be argued that passing on pointers to local variables is a bad habbit, but how does GCC care? ;)

#20872 - keldon - Tue May 18, 2004 9:49 pm

the passing of integers is simply pushing the value of that integer onto the stack .. tail recursion - or any recursion for that matter is fine

Pointers are another issue - it's fine so long as you remember that they're all pointing to the same piece of data