gbadev.org forum archive

I am trying to write a simple asm program to draw a square on the screen. I am using the concept of Y * W + X will give you the correct index but it doesnt seem to work. Does anyone see any obvious problem with my code?

Code:

.arm
.text
.global main

main:
mov r0, #0x4000000
mov r1, #0x400

add r1, r1, #3

str r1, [r0]


mov r0, #0x6000000 @ address of VRAM
mov r1, #0x6500 @ some pinkish color
add r1, r1, #0x9E

mov r2, #0xF0 @ the width of the screen
mov r3, #0x64 @X=100
mov r4, #0x32 @Y=50
mov r5, #0x32 @W & H = 50

loop1:
mul r6, r4, r5 @curposition = Y * Width
add r6, r6, r3 @curposition += X

loop2:
strh r1, [r0,r6] @ will store the 16bit value in r1 into address in r0, then
add r6,r6,#2 @ add 16bits to r6
subs r5, r5, #0x1 @ subtract 1 from W
bne loop2 @ if W!=0 goto loop2

add r1, r1, #0x1 @make the color something else
mov r5, #0x32 @Reset the W variable
add r4, r4, #1 @add 1 to Y variable
cmp r4, #100 @compare Y to 100
bne loop1 @if not equal goto loop 1

infin: @ an infinite loop

b infin

First, you should use the width of the screen to find the scanline addresses, not the width of your rectangle. Second, there is a difference between a bitmap's width and its pitch. The width is the number of pixels per scanline; the pitch is the number of bytes per scanline. The pitch for mode 3 is 480, not 240.

Something similar is true for how you use X: X=100 actually means a byte-offset of 2*100 for 16bit bitmaps.

C accounts for the difference between (u16*) and (u8*) automatically, but assembly doesn't.

You didn't say if you wanted Thumb or Arm. This is ARM, but could be converted to thumb with 2 extra instructions.

Code:

mov r0, #0x6000000 @ address of VRAM
mov r1, #0x6500 @ some pinkish color
add r1, r1, #0x9E

mov r2, #0xF0 @ Width of screen in pixels
mov r3, #0x64 @ X=100 pixels
mov r4, #0x32 @ Y=50
mov r5, #0x32 @ Width/height

mul r4,r4,r5 @ = y * width
add r4,r4,r3 @ add in X offset
add r0,r0,r4,lsl#1 @ Add to base pointer *2 for width

sub r2, r2,r5 @ how many pixels to next row
mov r4, r5 @ Setup lower loop Height counter

loop1:
mov r3, r5 @ Setup width counter

loop2:
strh r1,[r0]! @ Store pixel and update address
subs r3,r3,#1 @ loop till done
bne loop2 @ ...

add r1,r1,#1 @ Update pixel color
add r0,r0,r2,lsl#1 @ add in pixel to next row*2 for 16-bits
subs r4,r4,#1 @ loop till done
bne loop1 @ ...

Untested, but it show how to deal with 16/8 bit conversions on the fly which your code does not handle. I'm also assuming that your setup values are passed in and the values cannot be used as immediates below like you are doing with your #100. I also use 1 register less, but that matters little.

For real speed, you should draw large runs with stm and half word alignments on the beginning/end. If you were allowed to know the size of your rect before hand was always 50 pixels, the inner loop becomes lightning fast.

Finally, if you are drawing in an 8-bit mode, then the shifts need to swap around a bit, but the general idea still works.

oops the problem was i was using the square width and not the screen width.

Also, is there any way to make variables in asm?

In assembly language on x86 or ARM, you move the stack pointer down to make space for local variables.

See how procedure calls work in ARM and Thumb (pdf)
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

A quick search for Arm instruction sets tell me PUSH and POP can only be used in thumb mode?
If this is true how do I use the stack in ARM mode?

yaazz wrote:

A quick search for Arm instruction sets tell me PUSH and POP can only be used in thumb mode?
If this is true how do I use the stack in ARM mode?

The ARM equivalents of the PUSH and POP instructions are the STMFD and LDMFD instructions. Please read the PDF to see how.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.

example in arm:
stmfd sp!,{r0,r1,r2-r7,lr} @push

ldmfd sp!,{r0,r1,r2-r7,pc} @pop

note that you can't push LR and pop PC if you intend on returning to thumb mode...
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

I read through the procedure call standard, and although I learned a few things I didnt already know, I didnt see any mention of the STMFD and LDMFD instructions... Where should I be looking?

devkitArm accepts push and pop for ARM code as well, but it may be incompatible with other/older assemblers.

There's a very quick run-down of ARM/Thumb assembly in Tonc here. It contains the most important information and covers some GBA-specific items, but the information is densely packed. At least get the documents pointed to at the top; the quick-references in particular are very useful.

Some other, more extensive documents:
http://www.ee.ic.ac.uk/pcheung/teaching/ee2_computing/
http://www.heyrick.co.uk/assembler/
http://www.peter-cockerell.net/aalp/html/frames.html

The document tepples pointed to is somewhat outdated. A more recent version is this one: AAPCS

from TONC

.Lpool:
.word 0x06010000
.word far_var

@ Shorthand: use ldr= and GCC will manage the pool for you
ldr r0,=0x06010000 @ Load a value
ldr r0,=far_var @ Load far_var's address
ldr r0, [r0] @ Load far_var's contents

Is this far_var word just pseudocode? because if I try to do this to store a variable it wont compile

Im not really trying to make local vareiables, I want to make global variables

The point of that particular note wasn't about how to create variables, but about two methods that you can use to load large values and addresses (like the addresses of variables) into registers.
This code

Code:

ldr r0, .Lpool @ Load a value
ldr r0, .Lpool+4 @ Load far_var's address
ldr r0, [r0] @ Load far_var's contents
.Lpool:
.word 0x06010000
.word far_var

and this

Code:

ldr r0,=0x06010000 @ Load a value
ldr r0,=far_var @ Load far_var's address
ldr r0, [r0] @ Load far_var's contents

are functional equivalents. The line ".word far_var" doesn't create a far_var variable, it creates a space containing the address of far_var. Sorry if this was unclear.

To create the variable itself, you have to make a label followed by some data declarations. For example, this creates a global halfword called far_var and initializes it with 12:

Code:

@ Equivalent of
@ u16 far_var= 12;

.data @ Data section (IWRAM)
.align 1 @ Alignment for a halfword
.global far_var @ Make it global
far_var:
.hword 12 @ Allocate a halfword containing 12

Creation of variables is covered in more detail in the GNU assembler section. Note that other assemblers (the official ARM assembler, for example) may do things differently.

A decent way to get a feel for ARM asm and how the assembler expects things is to look at the compiler-generated assembly. Create some C code and add "-save-temps" to CFLAGS; the corresponding assembly will be in an .s file in the build directory. It may help to remove the -g flag when doing so: the debugging information makes the asm harder to read.

Ok here is my code so far, Im trying to make the ball bounce around the edges of the screen, but since i have added gloabal variables, it doesnt work anymore, does anyone see anything wrong?

Code:

.arm
.text
.global main

main:

mov r0, #0x4000000 @set up video control register
mov r1, #0x400
add r1, r1, #3
str r1, [r0]
@/////////////////////////////////

drawScreen:
ldr r3,=ballx @load in ball x and ball y
ldr r3,[r3]

ldr r4,=bally
ldr r4,[r4]

mov r5, #5 @W = 10
mov r7, #5 @H = 50

loop1:
ldr r0, =0x6000000 @goto vram
ldr r1, =0x659E @Go to pinkish Color
mov r2, #0x1E0 @width of screen

mul r2, r2, r4 @curposition = Y * Width
add r2, r2, r3 @curposition += X

loop2:
strh r1, [r0,r2] @ will store the 16bit value in r1 into address in r0, then
add r2,r2,#2 @ add 16bits to r6
subs r5, r5, #0x1 @ subtract 1 from W
bne loop2 @ if W!=0 goto loop2

mov r5, #0x5 @Reset the W variable
add r4,r4,#1
subs r7, r7, #1 @add 1 to Y variable
bne loop1 @if not equal goto loop 1

mov r7,#5

moveBall:
ldr r0, =ballx
ldr r0, [r0] @load ballx in r0

ldr r2, =ballSpdx
ldr r2, [r2] @load ballspdx in r0

mov r3,#-1 @Change sign of ball speed if needed
adds r0,r0,r2 @ballx += ballspeedx
muleq r2,r3,r2 @if x=0 ballspdx=-x

cmp r0,#480 @if ball=240 * 2bpp
muleq r2,r3,r2 @ballspdx = -ballspdx

ldr r4, =ballx @store ball x
str r0, [r4]

ldr r4, =ballSpdx @store ball speedx
str r1, [r4]

clrScreenEntry:
mov r0,#0x6000000
mov r1,#0x0
ldr r2,=38400
clrScreen:

strh r1, [r0] @ will store the 16bit value in r1 into address in r0, then
add r0,r0,#2 @ add 16bits to r6
subs r2, r2, #0x1 @ subtract 1 from r2
bne clrScreen @ if W!=0 goto loop2
b drawScreen

@ Global Variables

.data @ Data section (IWRAM)
// .align 1 @ Alignment for a halfword
.global ballx @ Make it global
ballx:
.word 10 @ Allocate a halfword containing 12

@ Data section (IWRAM)
//.align 1 @ Alignment for a halfword
.global bally @ Make it global
bally:
.word 10 @ Allocate a halfword containing 12

@ Data section (IWRAM)
// .align 1 @ Alignment for a halfword
.global ballSpdx @ Make it global
ballSpdx:
.word 3 @ Allocate a halfword containing 12

@ Data section (IWRAM)
//.align 1 @ Alignment for a halfword
.global ballSpdy @ Make it global
ballSpdy:
.word 3 @ Allocate a halfword containing 12

.word does not allocate a halfword
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."

Ooops I should have changed that comment.They are supposed to be words, I copied and pasted from Cearn's code above but changed .hword to word

Ok, I speak for all experienced coders here when I say:

Comments are as important as code - even more so when dealing in assembly language. PLEASE, take the time to go through your code and clean up the comments - especially before asking strangers for help. Once you do that, you'll get some more insightful responses.

yaazz wrote:

Ok here is my code so far, Im trying to make the ball bounce around the edges of the screen, but since i have added global variables, it doesnt work anymore, does anyone see anything wrong?

Code:

<<stuff>>

The code is mostly functional, but there are some small errors are one large efficiency problem that makes it hard to see anything. (And then there's the comments as Miked0801 said. If the comments and code disagree, they're both wrong)

First, the reason that the balls don't bounce.

yaazz wrote:

Code:

ldr r0, =ballx
ldr r0, [r0]    @load ballx in r0

ldr r2, =ballSpdx
ldr r2, [r2]    @load ballspdx in r0

mov r3,#-1       @Change sign of ball speed if needed
adds r0,r0,r2    @ballx += ballspeedx
muleq r2,r3,r2    @if x=0 ballspdx=-x

cmp r0,#480    @if ball=240 * 2bpp
muleq r2,r3,r2    @ballspdx = -ballspdx

ldr r4, =ballx @store ball x
str r0, [r4]

ldr r4, =ballSpdx @store ball speedx
str r1, [r4]

First, you're checking for equalities here ("muleq"). Since 10+k*3 will never be 0 or 480, the test always fails. What you need is 'lower than' (lt) and 'greater than/equal to' (ge). You're also writing the wrong register back to ballSpdx (should be r2, not r1). I'll assume you're not handling y-changes because you're saving that for later.

A second problem is that there's no timing signal to ensure a consistent frame rate (wait for VBlank, for example). Without this, the CPU will just blaze through the code as fast as possible. Now, it will seem like there is some sort of timing because of your method of clearing the screen:

yaazz wrote:

Code:

clrScreen:
strh r1, [r0] @ Nc + Nd
add r0,r0,#2 @ S
subs r2, r2, #0x1 @ S
bne clrScreen @ 2S + Nc

@ Total : 2 Nc + Nd + 4S = 2*8 + 1 + 4*6 = 41

This loop is in ROM, as ARM code, filling one halfword at a time. This is the worst combination of factors for VRAM clearing you can devise without actually trying. Each iteration of this loop takes 41 cycles, for a total of 41*240*160 = 1.57 Mcycles, or 5.6 frames. In other words, the vast majority of time in your code is currently spent erasing everything. Halfword fills are bad, very bad, here.

There are many, many ways of cutting down on the fill time. For example, using word-fills would cut that time in half already. For the best result, use CpuFastSet (swi 12). This uses octuple stmia's and the routine is in in faster memory (BIOS) to boot. CpuFastSet takes 1.3 cycles per halfword, giving a speed-up of about 30x.
CpuFastSet does have some alignment and size requirements though (see GBATek:CpuFastSet). For general 16-bit fills, consider using this. This routine should beat manual loops after about 6 halfwords or so.

There are other, smaller efficiency issues as well. For example, negating a sign can be done without multiplication. There is an RSB instruction for a Reverse SuBtraction. "rsb r0, r1, #0" equates to "r0 = 0 - r1", or "r0= -r1".

You should also try to limit memory accesses as much as possible. That means don't loading the variable addresses, using relative addressing from a base register, and keeping relevant data in registers as much as possible. For example, you have ballx, bally, balldx and balldy, and these items are adjacent in memory. So once you know one address, you have all of them. You can even use ldm instead of ldr to load multiple variables in one go.

Code:

@ Load address of ballx
ldr r0,=ballx

@ Load ballx and bally via relative addressing
ldr r4, [r0]
ldr r5, [r0, #4]

@ Load ballx and bally ldmia. Same as above, but faster.
ldmia r0, {r4-r5}

...
ballx:
.word 10
bally:
.word 10

This is actually how C structs are implemented as well. There is a label to some part of memory, followed by the different members. To load a member, you get the base pointer/address and use relative addressing to get the member you want, whether it's a byte, word, or anything.

And then there's rendering of the ball itself. Your rendering loop is twice the size that it should be, and there should be no multiplications or memory loads inside it. If you have data that doesn't actually change inside the loops (i.e., it's "loop invariant"), take it out of the loop! Where possible, also replace multiplies by incremental additions. This is exactly what Miked0801's code did, so use that as an example. You will have to change "strh r1,[r0]!" to "strh r1,[r0], #2". Quick addressing/write-back overview:

Code:

@ Pre-indexed addressing
ldr Rd, [Rm, #4] Rd = *(u32*)(Rm+4)

@ Pre-indexed, with writeback
ldr Rd, [Rm, #4]! Rd = *(u32*)(Rm+4) ; Rm += 4

@ Post-indexed with writeback
ldr Rd, [Rm], #4 Rd = *(u32*)(Rm) ; Rm += 4

As you can see, updating pointers don't need a separate instruction in ARM; you can use write-backs. So using "strh r3, [r0]; add r0, r0, #2" can be done in one instruction: "strh r3, [r0], #2".

The code below if pretty much what it ought to be, including stacking for used registers so that it can be safely called from C code. There are some sneaky bits, but I hope most of it is understandable. Note that there is a pre-processor switch for vsynching with interrupts, because I'm not sure you have interrupts enabled right now. Also, the collision detection is still flawed. Correcting it has been left as an exercise for the reader.

Code:

.text
.arm
.align
.global bouncer

bouncer:
@ Save clobbered registers (r0-r3,ip are free)
stmfd sp!, {r4-r7, lr}

mov r0, #0x4000000 @ Set up video control register
mov r1, #0x400
add r1, r1, #3
str r1, [r0]

@ --- This is the main loop ---
.LmainLoop:

@ --- Register list ---
@ r0 : sourcy stuff
@ r1 : desty stuff
@ r2 : countery stuff
@ r3 : datay stuff / temp variable
@ r4 : ball.x
@ r5 : ball.y
@ r6 : ball.w; later ball.dx
@ r7 : ball.h; later ball.dy
@ ip : &ball

ldr ip,=ball @ load ball
ldmia ip!, {r4-r5} @ ball.x, ball.y

mov r6, #5 @ ball.w
mov r7, #5 @ ball.h

@ --- Rendering the ball ---

@# NOTE: do as much outside of the loops as possible
@# The following instructions are preparation steps that
@# represent 'loop invariant' quantities: things that
@# don't change inside the loop.
@#
@# * Get a pointer to a corner of the destination
@# (in this case the top-left).
@# * Instead of multiplying y*pitch, use incremental
@# offsets : dst += pitch
@# * That said, as the inner loop increments the dst pointer
@# the distance to the next scanline is ball.w less
@# then the actual pitch. This can be calculated outside the
@# loop as well.

ldr r0,=0x659E @ Pinkish color

@ u16 *dst = &vram_16[y*240+x] = VRAM + y*(512-32) + x*2
ldr r1,=0x6000000
add r1, r1, r5, lsl #9
sub r1, r1, r5, lsl #5
add r1, r1, r4, lsl #1

rsb r3, r7, #240 @ screen.w - ball.w

.LloopY:
mov r2, r6 @ Reload ball.w
.LloopX:
strh r0, [r1], #2
subs r2, r2, #1
bne .LloopX

add r1, r1, r3, lsl #1
subs r7, r7, #1
bne .LloopY

@ --- Ball movement and collision detection (faulty) ---

ldmia ip!, {r6-r7} @ Load ball.dx, ball.dy

@ x += dx; if(x<0 || x>=240) dx= -dx;
adds r4, r4, r6
rsblt r6, r6, #0
cmp r4, #240
rsbge r6, r6, #0

@ y += dy; if(y<0 || y>=160) dy= -dy;
adds r5, r5, r7
rsblt r7, r7, #0
cmp r5, #160
rsbge r7, r7, #0

stmdb ip!, {r4-r7} @ Save x, y, dx, dy

@ --- VBlank synchronization ---
@# NOTE : you always need a timing signal for consistent
@# framerates. This is usually the VBlank

#if (I_CAN_HAS_INTERRUPTS==1)

@ VBlankIntrWait. Only use if interrupts work properly
swi #0x50000

#else

@ V-synch via REG_VCOUNT
mov r0, #0x04000000
.LwaitForVDraw:
ldrh r3, [r0, #6]
cmp r3, #160
bge .LwaitForVDraw

.LwaitForVBlank:
ldrh r3, [r0, #6]
cmp r3, #160
blt .LwaitForVBlank

#endif

@ --- Clear screen with CpuFastSet ---
@# CpuFastSet is the fastest way to fill large chunks of
@# memory, but comes with requirements. Look it up in GBATek.

mov r0, #0
stmfd sp!, {r0} @ Reserve a place for zero on the stack.
mov r0, sp @ Source-ptr
mov r1, #0x06000000 @ Destination
ldr r2,=0x01004B00 @ Count + fill-bit: 240*160/2 + BIT(24)
swi #0xC0000 @ Clear screen
add sp, sp, #4 @ 'pop' zero off again

b .LmainLoop

ldmfd sp!, {r4-r7, lr} @ Restore used registers
bx lr

C driver if necessary:

Code:

void bouncer();

int main()
{
//# Set-up interrupts and whatever

bouncer();
return 0;
}

Many of the optimizations I've used here could already be done in C. Coding in assembly should only be done when you can write code that's faster than what the compiler can create, but you can only do that effectively if you already know some of the basic techniques. It may help to learn more about common optimization techniques before writing your own asm routines. The various loop optimizations are especially useful here.

Also look at the compiler-generated assembly. It already knows many clever tricks that you might not, and has the benefit of writing functionally correct code (if it didn't, people would get very cranky). Having said that, it does still do a few silly things, most notably in loops. These things tend to be obvious, though.

gbadev.org forum archive

ASM > trying to draw a square

#153284 - yaazz - Thu Mar 27, 2008 6:22 pm

#153289 - Cearn - Thu Mar 27, 2008 8:06 pm

#153293 - Miked0801 - Thu Mar 27, 2008 10:33 pm

#153324 - yaazz - Fri Mar 28, 2008 3:40 pm

#153326 - tepples - Fri Mar 28, 2008 4:28 pm

#153411 - yaazz - Sat Mar 29, 2008 9:31 pm

#153412 - tepples - Sat Mar 29, 2008 9:34 pm

#153415 - Dwedit - Sat Mar 29, 2008 10:58 pm

#153451 - yaazz - Sun Mar 30, 2008 10:37 pm

#153479 - Cearn - Mon Mar 31, 2008 5:11 pm

#153668 - yaazz - Thu Apr 03, 2008 5:59 pm

#153671 - Cearn - Thu Apr 03, 2008 6:44 pm

#153681 - yaazz - Thu Apr 03, 2008 10:00 pm

#153694 - Dwedit - Fri Apr 04, 2008 12:35 am

#153727 - yaazz - Fri Apr 04, 2008 4:10 pm

#153733 - Miked0801 - Fri Apr 04, 2008 7:09 pm

#153864 - Cearn - Mon Apr 07, 2008 3:13 pm