gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > trying to draw a square

#153284 - yaazz - Thu Mar 27, 2008 6:22 pm

I am trying to write a simple asm program to draw a square on the screen. I am using the concept of Y * W + X will give you the correct index but it doesnt seem to work. Does anyone see any obvious problem with my code?

Code:
 
.arm
.text
.global main

main:
mov r0, #0x4000000
mov r1, #0x400
         
add r1, r1, #3   
            
str r1, [r0]

      
mov r0, #0x6000000  @ address of VRAM
mov r1, #0x6500     @ some pinkish color
add r1, r1, #0x9E

mov r2, #0xF0       @ the width of the screen
mov r3, #0x64       @X=100
mov r4, #0x32       @Y=50
mov r5, #0x32       @W & H = 50




loop1:
   mul r6, r4, r5   @curposition = Y * Width
   add r6, r6, r3   @curposition += X

loop2:
   strh r1, [r0,r6]  @ will store the 16bit value in r1 into address in r0, then
    add r6,r6,#2      @ add 16bits to r6
    subs r5, r5, #0x1 @ subtract 1 from W
    bne loop2         @ if W!=0 goto loop2   
   
    add r1, r1, #0x1  @make the color something else
    mov r5, #0x32     @Reset the W variable
    add r4, r4, #1    @add 1 to Y variable
    cmp r4, #100       @compare Y to 100
    bne loop1         @if not equal goto loop 1
   
infin:                @ an infinite loop
   

b infin

#153289 - Cearn - Thu Mar 27, 2008 8:06 pm

First, you should use the width of the screen to find the scanline addresses, not the width of your rectangle. Second, there is a difference between a bitmap's width and its pitch. The width is the number of pixels per scanline; the pitch is the number of bytes per scanline. The pitch for mode 3 is 480, not 240.

Something similar is true for how you use X: X=100 actually means a byte-offset of 2*100 for 16bit bitmaps.

C accounts for the difference between (u16*) and (u8*) automatically, but assembly doesn't.

#153293 - Miked0801 - Thu Mar 27, 2008 10:33 pm

You didn't say if you wanted Thumb or Arm. This is ARM, but could be converted to thumb with 2 extra instructions.

Code:

    mov r0, #0x6000000  @ address of VRAM
    mov r1, #0x6500     @ some pinkish color
    add r1, r1, #0x9E     

    mov r2, #0xF0       @ Width of screen in pixels
    mov r3, #0x64       @ X=100 pixels
    mov r4, #0x32       @ Y=50
    mov r5, #0x32       @ Width/height

    mul  r4,r4,r5       @  = y * width
    add  r4,r4,r3       @ add in X offset
    add  r0,r0,r4,lsl#1 @ Add to base pointer *2 for width

    sub  r2, r2,r5      @ how many pixels to next row
    mov r4, r5          @ Setup lower loop Height counter

loop1:
    mov r3, r5          @ Setup width counter

loop2:
    strh r1,[r0]!       @ Store pixel and update address
    subs r3,r3,#1       @ loop till done
    bne loop2           @ ...

    add r1,r1,#1        @ Update pixel color
    add r0,r0,r2,lsl#1  @ add in pixel to next row*2 for 16-bits
    subs  r4,r4,#1      @ loop till done
    bne loop1           @ ...



Untested, but it show how to deal with 16/8 bit conversions on the fly which your code does not handle. I'm also assuming that your setup values are passed in and the values cannot be used as immediates below like you are doing with your #100. I also use 1 register less, but that matters little.

For real speed, you should draw large runs with stm and half word alignments on the beginning/end. If you were allowed to know the size of your rect before hand was always 50 pixels, the inner loop becomes lightning fast.

Finally, if you are drawing in an 8-bit mode, then the shifts need to swap around a bit, but the general idea still works.

#153324 - yaazz - Fri Mar 28, 2008 3:40 pm

oops the problem was i was using the square width and not the screen width.

Also, is there any way to make variables in asm?

#153326 - tepples - Fri Mar 28, 2008 4:28 pm

In assembly language on x86 or ARM, you move the stack pointer down to make space for local variables.

See how procedure calls work in ARM and Thumb (pdf)
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#153411 - yaazz - Sat Mar 29, 2008 9:31 pm

A quick search for Arm instruction sets tell me PUSH and POP can only be used in thumb mode?
If this is true how do I use the stack in ARM mode?

#153412 - tepples - Sat Mar 29, 2008 9:34 pm

yaazz wrote:
A quick search for Arm instruction sets tell me PUSH and POP can only be used in thumb mode?
If this is true how do I use the stack in ARM mode?

The ARM equivalents of the PUSH and POP instructions are the STMFD and LDMFD instructions. Please read the PDF to see how.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

#153415 - Dwedit - Sat Mar 29, 2008 10:58 pm

example in arm:
stmfd sp!,{r0,r1,r2-r7,lr} @push

ldmfd sp!,{r0,r1,r2-r7,pc} @pop

note that you can't push LR and pop PC if you intend on returning to thumb mode...
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#153451 - yaazz - Sun Mar 30, 2008 10:37 pm

I read through the procedure call standard, and although I learned a few things I didnt already know, I didnt see any mention of the STMFD and LDMFD instructions... Where should I be looking?

#153479 - Cearn - Mon Mar 31, 2008 5:11 pm

devkitArm accepts push and pop for ARM code as well, but it may be incompatible with other/older assemblers.

There's a very quick run-down of ARM/Thumb assembly in Tonc here. It contains the most important information and covers some GBA-specific items, but the information is densely packed. At least get the documents pointed to at the top; the quick-references in particular are very useful.

Some other, more extensive documents:
http://www.ee.ic.ac.uk/pcheung/teaching/ee2_computing/
http://www.heyrick.co.uk/assembler/
http://www.peter-cockerell.net/aalp/html/frames.html

The document tepples pointed to is somewhat outdated. A more recent version is this one: AAPCS

#153668 - yaazz - Thu Apr 03, 2008 5:59 pm

from TONC

.Lpool:
.word 0x06010000
.word far_var

@ Shorthand: use ldr= and GCC will manage the pool for you
ldr r0,=0x06010000 @ Load a value
ldr r0,=far_var @ Load far_var's address
ldr r0, [r0] @ Load far_var's contents

Is this far_var word just pseudocode? because if I try to do this to store a variable it wont compile

Im not really trying to make local vareiables, I want to make global variables

#153671 - Cearn - Thu Apr 03, 2008 6:44 pm

The point of that particular note wasn't about how to create variables, but about two methods that you can use to load large values and addresses (like the addresses of variables) into registers.
This code
Code:
    ldr     r0, .Lpool      @ Load a value
    ldr     r0, .Lpool+4    @ Load far_var's address
    ldr     r0, [r0]        @ Load far_var's contents
.Lpool:
    .word   0x06010000
    .word   far_var

and this
Code:
    ldr     r0,=0x06010000  @ Load a value
    ldr     r0,=far_var     @ Load far_var's address
    ldr     r0, [r0]        @ Load far_var's contents

are functional equivalents. The line ".word far_var" doesn't create a far_var variable, it creates a space containing the address of far_var. Sorry if this was unclear.

To create the variable itself, you have to make a label followed by some data declarations. For example, this creates a global halfword called far_var and initializes it with 12:
Code:
@ Equivalent of
@ u16 far_var= 12;

    .data                @ Data section (IWRAM)
    .align 1             @ Alignment for a halfword
    .global far_var      @ Make it global
far_var:
    .hword   12          @ Allocate a halfword containing 12

Creation of variables is covered in more detail in the GNU assembler section. Note that other assemblers (the official ARM assembler, for example) may do things differently.

A decent way to get a feel for ARM asm and how the assembler expects things is to look at the compiler-generated assembly. Create some C code and add "-save-temps" to CFLAGS; the corresponding assembly will be in an .s file in the build directory. It may help to remove the -g flag when doing so: the debugging information makes the asm harder to read.

#153681 - yaazz - Thu Apr 03, 2008 10:00 pm

Ok here is my code so far, Im trying to make the ball bounce around the edges of the screen, but since i have added gloabal variables, it doesnt work anymore, does anyone see anything wrong?

Code:




.arm
.text
.global main

main:


mov r0, #0x4000000   @set up video control register
mov r1, #0x400
add r1, r1, #3
str r1, [r0]
@/////////////////////////////////



      
drawScreen:
   ldr r3,=ballx    @load in ball x and ball y
   ldr r3,[r3]

   ldr r4,=bally
   ldr r4,[r4]
   

mov r5, #5       @W = 10
mov r7, #5       @H = 50

loop1:
    ldr r0, =0x6000000  @goto vram
    ldr r1, =0x659E     @Go to pinkish Color
    mov r2, #0x1E0      @width of screen
   

   
    mul r2, r2, r4   @curposition = Y * Width
    add r2, r2, r3   @curposition += X



loop2:
   strh r1, [r0,r2]  @ will store the 16bit value in r1 into address in r0, then
    add r2,r2,#2      @ add 16bits to r6
    subs r5, r5, #0x1 @ subtract 1 from W
        bne loop2         @ if W!=0 goto loop2   

    mov r5, #0x5     @Reset the W variable
    add r4,r4,#1
    subs r7, r7, #1    @add 1 to Y variable
        bne loop1         @if not equal goto loop 1

    mov r7,#5


moveBall:
    ldr r0, =ballx
    ldr r0, [r0]        @load ballx in r0

    ldr r2, =ballSpdx
    ldr r2, [r2]        @load ballspdx in r0

    mov r3,#-1          @Change sign of ball speed if needed
    adds r0,r0,r2       @ballx += ballspeedx
    muleq r2,r3,r2      @if x=0 ballspdx=-x
   
    cmp r0,#480         @if ball=240 * 2bpp
    muleq r2,r3,r2      @ballspdx = -ballspdx
   
    ldr r4, =ballx   @store ball x
    str r0, [r4]

    ldr r4, =ballSpdx @store ball speedx
    str r1, [r4]



clrScreenEntry:
    mov r0,#0x6000000
    mov r1,#0x0
    ldr r2,=38400
clrScreen:
   
    strh r1, [r0]  @ will store the 16bit value in r1 into address in r0, then
    add r0,r0,#2      @ add 16bits to r6
    subs r2, r2, #0x1 @ subtract 1 from r2
        bne clrScreen         @ if W!=0 goto loop2   
            b drawScreen
   
   
@ Global Variables

    .data                @ Data section (IWRAM)
   // .align 1             @ Alignment for a halfword
    .global ballx        @ Make it global
ballx:
    .word   10          @ Allocate a halfword containing 12
   

                   @ Data section (IWRAM)
    //.align 1             @ Alignment for a halfword
    .global bally      @ Make it global
bally:
    .word   10          @ Allocate a halfword containing 12
   


                  @ Data section (IWRAM)
   // .align 1             @ Alignment for a halfword
    .global ballSpdx      @ Make it global
ballSpdx:
    .word   3          @ Allocate a halfword containing 12
   

                    @ Data section (IWRAM)
    //.align 1             @ Alignment for a halfword
    .global ballSpdy      @ Make it global
ballSpdy:
    .word   3          @ Allocate a halfword containing 12
   


#153694 - Dwedit - Fri Apr 04, 2008 12:35 am

.word does not allocate a halfword
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

#153727 - yaazz - Fri Apr 04, 2008 4:10 pm

Ooops I should have changed that comment.They are supposed to be words, I copied and pasted from Cearn's code above but changed .hword to word

#153733 - Miked0801 - Fri Apr 04, 2008 7:09 pm

Ok, I speak for all experienced coders here when I say:

Comments are as important as code - even more so when dealing in assembly language. PLEASE, take the time to go through your code and clean up the comments - especially before asking strangers for help. Once you do that, you'll get some more insightful responses.

#153864 - Cearn - Mon Apr 07, 2008 3:13 pm

yaazz wrote:
Ok here is my code so far, Im trying to make the ball bounce around the edges of the screen, but since i have added global variables, it doesnt work anymore, does anyone see anything wrong?

Code:
<<stuff>>

The code is mostly functional, but there are some small errors are one large efficiency problem that makes it hard to see anything. (And then there's the comments as Miked0801 said. If the comments and code disagree, they're both wrong)

First, the reason that the balls don't bounce.
yaazz wrote:
Code:
   ldr r0, =ballx
   ldr r0, [r0]      @load ballx in r0

   ldr r2, =ballSpdx
   ldr r2, [r2]      @load ballspdx in r0

   mov r3,#-1         @Change sign of ball speed if needed
   adds r0,r0,r2      @ballx += ballspeedx
   muleq r2,r3,r2      @if x=0 ballspdx=-x
   
   cmp r0,#480       @if ball=240 * 2bpp
   muleq r2,r3,r2      @ballspdx = -ballspdx
   
   ldr r4, =ballx    @store ball x
   str r0, [r4]

   ldr r4, =ballSpdx @store ball speedx
   str r1, [r4]

First, you're checking for equalities here ("muleq"). Since 10+k*3 will never be 0 or 480, the test always fails. What you need is 'lower than' (lt) and 'greater than/equal to' (ge). You're also writing the wrong register back to ballSpdx (should be r2, not r1). I'll assume you're not handling y-changes because you're saving that for later.

A second problem is that there's no timing signal to ensure a consistent frame rate (wait for VBlank, for example). Without this, the CPU will just blaze through the code as fast as possible. Now, it will seem like there is some sort of timing because of your method of clearing the screen:
yaazz wrote:
Code:

clrScreen:
    strh r1, [r0]           @ Nc + Nd
    add r0,r0,#2            @ S
    subs r2, r2, #0x1       @ S
    bne clrScreen           @ 2S + Nc

    @ Total : 2 Nc + Nd + 4S = 2*8 + 1 + 4*6 = 41
This loop is in ROM, as ARM code, filling one halfword at a time. This is the worst combination of factors for VRAM clearing you can devise without actually trying. Each iteration of this loop takes 41 cycles, for a total of 41*240*160 = 1.57 Mcycles, or 5.6 frames. In other words, the vast majority of time in your code is currently spent erasing everything. Halfword fills are bad, very bad, here.

There are many, many ways of cutting down on the fill time. For example, using word-fills would cut that time in half already. For the best result, use CpuFastSet (swi 12). This uses octuple stmia's and the routine is in in faster memory (BIOS) to boot. CpuFastSet takes 1.3 cycles per halfword, giving a speed-up of about 30x.
   CpuFastSet does have some alignment and size requirements though (see GBATek:CpuFastSet). For general 16-bit fills, consider using this. This routine should beat manual loops after about 6 halfwords or so.

There are other, smaller efficiency issues as well. For example, negating a sign can be done without multiplication. There is an RSB instruction for a Reverse SuBtraction. "rsb r0, r1, #0" equates to "r0 = 0 - r1", or "r0= -r1".

You should also try to limit memory accesses as much as possible. That means don't loading the variable addresses, using relative addressing from a base register, and keeping relevant data in registers as much as possible. For example, you have ballx, bally, balldx and balldy, and these items are adjacent in memory. So once you know one address, you have all of them. You can even use ldm instead of ldr to load multiple variables in one go.
Code:

    @ Load address of ballx
    ldr     r0,=ballx
   
   @ Load ballx and bally via relative addressing
    ldr     r4, [r0]
    ldr     r5, [r0, #4]

   @ Load ballx and bally ldmia. Same as above, but faster.
   ldmia    r0, {r4-r5}
   
    ...
ballx:
    .word   10
bally:
    .word   10

This is actually how C structs are implemented as well. There is a label to some part of memory, followed by the different members. To load a member, you get the base pointer/address and use relative addressing to get the member you want, whether it's a byte, word, or anything.

And then there's rendering of the ball itself. Your rendering loop is twice the size that it should be, and there should be no multiplications or memory loads inside it. If you have data that doesn't actually change inside the loops (i.e., it's "loop invariant"), take it out of the loop! Where possible, also replace multiplies by incremental additions. This is exactly what Miked0801's code did, so use that as an example. You will have to change "strh r1,[r0]!" to "strh r1,[r0], #2". Quick addressing/write-back overview:
Code:
@ Pre-indexed addressing
ldr     Rd, [Rm, #4]    Rd = *(u32*)(Rm+4)

@ Pre-indexed, with writeback
ldr     Rd, [Rm, #4]!   Rd = *(u32*)(Rm+4)  ; Rm += 4

@ Post-indexed with writeback
ldr     Rd, [Rm], #4    Rd = *(u32*)(Rm)    ; Rm += 4

As you can see, updating pointers don't need a separate instruction in ARM; you can use write-backs. So using "strh r3, [r0]; add r0, r0, #2" can be done in one instruction: "strh r3, [r0], #2".

The code below if pretty much what it ought to be, including stacking for used registers so that it can be safely called from C code. There are some sneaky bits, but I hope most of it is understandable. Note that there is a pre-processor switch for vsynching with interrupts, because I'm not sure you have interrupts enabled right now. Also, the collision detection is still flawed. Correcting it has been left as an exercise for the reader.

Code:
    .text
    .arm
    .align
    .global bouncer

bouncer:
    @ Save clobbered registers (r0-r3,ip are free)
    stmfd   sp!, {r4-r7, lr}

    mov r0, #0x4000000   @ Set up video control register
    mov r1, #0x400
    add r1, r1, #3
    str r1, [r0]

    @ --- This is the main loop ---
.LmainLoop:
   
    @ --- Register list ---
    @ r0 : sourcy stuff
    @ r1 : desty stuff
    @ r2 : countery stuff
    @ r3 : datay stuff / temp variable
    @ r4 : ball.x
    @ r5 : ball.y
    @ r6 : ball.w; later ball.dx
    @ r7 : ball.h; later ball.dy
    @ ip : &ball

    ldr     ip,=ball            @ load ball
    ldmia   ip!, {r4-r5}        @ ball.x, ball.y

    mov     r6, #5              @ ball.w
    mov     r7, #5              @ ball.h

    @ --- Rendering the ball ---

    @# NOTE: do as much outside of the loops as possible
    @# The following instructions are preparation steps that
    @# represent 'loop invariant' quantities: things that
    @# don't change inside the loop.
    @#
    @# * Get a pointer to a corner of the destination
    @#   (in this case the top-left).
    @# * Instead of multiplying y*pitch, use incremental
    @#   offsets : dst += pitch
    @# * That said, as the inner loop increments the dst pointer
    @#   the distance to the next scanline is ball.w less
    @#   then the actual pitch. This can be calculated outside the
    @#   loop as well.

    ldr     r0,=0x659E          @ Pinkish color

    @ u16 *dst = &vram_16[y*240+x] = VRAM + y*(512-32) + x*2
    ldr     r1,=0x6000000
    add     r1, r1, r5, lsl #9
    sub     r1, r1, r5, lsl #5
    add     r1, r1, r4, lsl #1

    rsb     r3, r7, #240        @ screen.w - ball.w

.LloopY:
        mov     r2, r6          @ Reload ball.w
.LloopX:
            strh    r0, [r1], #2
            subs    r2, r2, #1
            bne     .LloopX

        add     r1, r1, r3, lsl #1
        subs    r7, r7, #1
        bne     .LloopY

    @ --- Ball movement and collision detection (faulty) ---

    ldmia   ip!, {r6-r7}        @ Load ball.dx, ball.dy

    @ x += dx; if(x<0 || x>=240) dx= -dx;
    adds    r4, r4, r6
    rsblt   r6, r6, #0
    cmp     r4, #240
    rsbge   r6, r6, #0

    @ y += dy; if(y<0 || y>=160) dy= -dy;
    adds    r5, r5, r7
    rsblt   r7, r7, #0
    cmp     r5, #160
    rsbge   r7, r7, #0

    stmdb   ip!, {r4-r7}        @ Save x, y, dx, dy

    @ --- VBlank synchronization ---
    @# NOTE : you always need a timing signal for consistent
    @# framerates. This is usually the VBlank

#if (I_CAN_HAS_INTERRUPTS==1)

    @ VBlankIntrWait. Only use if interrupts work properly
    swi     #0x50000

#else

    @ V-synch via REG_VCOUNT
    mov     r0, #0x04000000
.LwaitForVDraw:
        ldrh    r3, [r0, #6]
        cmp     r3, #160
        bge     .LwaitForVDraw
               
.LwaitForVBlank:
        ldrh    r3, [r0, #6]
        cmp     r3, #160
        blt     .LwaitForVBlank

#endif

    @ --- Clear screen with CpuFastSet ---
    @# CpuFastSet is the fastest way to fill large chunks of
    @# memory, but comes with requirements. Look it up in GBATek.

    mov     r0, #0
    stmfd   sp!, {r0}           @ Reserve a place for zero on the stack.
    mov     r0, sp              @ Source-ptr
    mov     r1, #0x06000000     @ Destination
    ldr     r2,=0x01004B00      @ Count + fill-bit: 240*160/2 + BIT(24)
    swi     #0xC0000            @ Clear screen
    add     sp, sp, #4          @ 'pop' zero off again

    b .LmainLoop

    ldmfd   sp!, {r4-r7, lr}    @ Restore used registers
    bx  lr

C driver if necessary:
Code:

void bouncer();

int main()
{
    //# Set-up interrupts and whatever

    bouncer();
    return 0;
}

Many of the optimizations I've used here could already be done in C. Coding in assembly should only be done when you can write code that's faster than what the compiler can create, but you can only do that effectively if you already know some of the basic techniques. It may help to learn more about common optimization techniques before writing your own asm routines. The various loop optimizations are especially useful here.

Also look at the compiler-generated assembly. It already knows many clever tricks that you might not, and has the benefit of writing functionally correct code (if it didn't, people would get very cranky). Having said that, it does still do a few silly things, most notably in loops. These things tend to be obvious, though.