gbadev.org forum archive

This is a read-only mirror of the content originally found on forum.gbadev.org (now offline), salvaged from Wayback machine copies. A new forum can be found here.

ASM > Can this be optimized better?

#72268 - aaronphughes - Fri Feb 17, 2006 8:04 am

I have the following code that calculates an IP checksum, as in the checksum found in an IP header. I have included the code. Can it be optimzed any better?

Thanks,
AH


b MainFunction


UnitTest1:
@DCW 0x0100, 0xF203, 0xF4F5, 0xF6F7
@DCW 0x0000, 0x0000, 0x0000, 0x0000 ;; ChkSum = 0x210E

UnitTest2:
@DCW 0x45C0, 0x0044, 0xad22, 0x0159
@DCW 0xAC14, 0x7143, 0xE000, 0x0005 ;; ChkSum = 0x0E22

UnitTest3:
@DCW 0x45C0, 0x0044, 0xDC15, 0x0159
@DCW 0xAC14, 0x7142, 0xE000, 0x0005 ;; ChkSum = 0xDF2F

MainFunction:



ldr R1,=UnitTest1 ;; This is the ADDRESS of my UnitTest Data
ldr R2,=8 ;; Size of the array of 16 bit words

mov R3,#0 ;; clear this guy out
ldr R4,=0xFFFF ;; MASK for lower 16 bits

lbl_AddTheData1:

ldr R5,[R1]+2!, ;; get value @ R1, put into R4, inc R1 by 2 - moves to next word
and R5, R5, R4 ;; only want lower 16 bits
add R3, R3, R5 ;; sum these values

subs R2, R2, 1 ;; dec by 1
bne lbl_AddTheData1 ;; Repeat until we hit the end of the array

lbl_AdditionComplete:

;;mov R2, R3, LSR #16 ;; The upper 16-bits are the carries
;;add R3, R3, R2 ;; Add the carries to the sum in R3
add R3, R3, R3 LSR #16 ;; The above two lines in one command

EOR R3, R3, R4 ;; NOT R3
and R3, R3, R4 ;; Only want the lower 16-bits

EndOfProgram:

B EndOfProgram
_________________
urpNO_SPAM_MAN@canerdian.ca
http://www.canerdian.ca

#72277 - FluBBa - Fri Feb 17, 2006 9:18 am

use LDRH for reading 16bit instead of 32bit.
_________________
I probably suck, my not is a programmer.

#72821 - strager - Tue Feb 21, 2006 11:57 pm

Here is what I have come up with. I tried it with the first data set (on paper), and it yeilds the correct results.
Code:

MainFunction:       ; Optimized for speed and size
; User init.
ldr r1, =UnitTest1  ; Load our sample data
ldr r2, =8          ; The array size is 8

; Program init.
mov r3, #0          ; Reset our sub counter
mov r2, r2, lsl #2  ; Multiply the array size by two (get bytesize)

; Sum loop
loop:
subs r2, r2, #2     ; Go back two bytes
ldrh r5, [r1, r2]   ; Load the data
sub r3, r3, r5      ; Subtract ("add") our data
bne loop            ; Loop if not the first byte

; r3 now contains the negative sum of the data

add r3, r3, r3, lsl #16 ; Add the upper and lower half-words, storing it in the upper half-word
mov r3, r3, lsr #16     ; Shift the result back into place.

My major itching was that your code used masks. I didn't like this, so I used some fancy trickery to let the last two instructions do the masking for me. Also, this bit of code assumes that you are optimizing for both speed and size; if you want more speed, you can use 32-bit reads instead.

Edit: Fixed some typos.


Last edited by strager on Wed Feb 22, 2006 10:56 pm; edited 1 time in total

#72959 - aaronphughes - Wed Feb 22, 2006 5:14 pm

Hi and thanks for the suggestions. Strager, in your code, I cannot see where the actual data (in R5) is being used to generate the checksum? Also when I ran your code, using UnitTest3, I do not get the correct checksum.

Based on some experimenting and FluBBa's suggestion, I have came up with the following code.



MainFunction:

ldr R1,=UnitTest2 ;; 1st Argument, address of array of 16-bit values

ldr R2,=8 ;; 2nd Argument, length of array

mov R3,#0 ;; zero out - will hold the SUM of 16 bit array

ldr R4,=0xFFFF ;; MASK for lower 16 bits - used frequently



lbl_SummingTheArray:

ldrh R5,[R1]+2! ;; load the 16-bit value @ R1, inc R1 pointer by 2

add R3, R3, R5 ;; sum these values

subs R2, R2, 1 ;; dec by 1

bne lbl_SummingTheArray ;; Repeat until the end of the array

FinalStepIsAddingTheCarriesAndInverting:

;;mov R2, R3, LSR #16 ;; The upper 16-bits are the carries
;;add R3, R3, R2 ;; Add the carries to the sum in R3

add R3, R3, R3 LSR #16 ;; The above two lines in one command

;;EOR R3, R3, R4 ;; NOT R3
;;and R3, R3, R4 ;; Only want the lower 16-bits

;; The above two lines can be done with BIC
BIC R3, R4, R3 ;; NOT R3, then AND with R4, result in R3



EndOfProgram:
B EndOfProgram

#72977 - Cearn - Wed Feb 22, 2006 7:31 pm

aaronphughes wrote:
Hi and thanks for the suggestions. Strager, in your code, I cannot see where the actual data (in R5) is being used to generate the checksum? Also when I ran your code, using UnitTest3, I do not get the correct checksum.

strager's procedure is sound, only the 'r2' in the sub before the loop-branch should have been an 'r5'. Change that and it's fixed. This would be my own version, in GNU asm:
Code:
@ code tags are good for displaying code!

@ === u16 ipcrc(const void *data) CODE_IN_IWRAM; ======================
@ NOTE! GNU asm syntax.
@ reglist:  r0-r3: data pointer, value, counter and sum

   .section .iwram,"ax", %progbits
   .align   2
   .code   32
   .global   ipcrc
ipcrc:
   mov   r2, #8
   mov   r3, #0

   @ Summing data; using -x instead of +x already inverts the sum
.Lcrc_loop:
      ldrh   r1,[r0],#2
      sub    r3, r3, r1
      subs   r2, r2, #1
      bne .Lcrc_loop

   @ add sum (lower halfword) to the carries (upper hw)
   @ then shift sum+carries down and into the return register
   add   r3, r3, r3, lsl #16
   mov   r0, r3, lsr #16
   bx lr


strager wrote:
... if you want more speed, you can use 32-bit writes instead.

You meant 'reads', right? Not sure that'd help, though: yes, the loads are quicker, but simply adding words will screw up the carries (at least that's what happened when I tested it) and masking to compensate defeats the purpose of loading words in the first place.

Another optimization would be to unroll the loop, but that's cheating. :P

#73223 - strager - Thu Feb 23, 2006 11:41 pm

I've worked it out to be a little faster without loop unrolling. This one reads two halfwords at a time, then adds those two together with the counter. Then, it traps the overflow and adds it to the counter if necessary.
Code:
.arm
.global   ipsum

; u16 ipsum(u16 *data); - Finds the IPSUM of data
; Notes - If the halfword size is not even (x%2==1),
; the function hangs.  No way to get around this unless
; you want to waste an extra four bytes and a few cycles.
; Current function cycles: 5 + size / 2 * (5 or 6)
; Old function cycles:     5 + size * 4

ipsum:      ; Finds the sum of an IP address (?)
mov r2, #8  ; Halfword size
mov r3, #0  ; Counter

ipsum_loop:                 ; Main loop
ldr r1, [r0], #4            ; Load 2 halfwords
adds r1, r1, r1, lsl #16    ; Add them
sub r3, r3, r1, lsr #16     ; Sub from counter
subvs r3, r3, #0x00010000   ; Sub overflow
subs r2, r2, #2             ; Sub size
bne ipsum_loop              ; Loop if size > 0

add r3, r3, r3, lsl #16 ; Add the two halfwords
mov r0, r3, lsr #16     ; Shift back

bx lr   ; Return


I tried working it out on paper, and I got the same results (with only four iterations). Just be aware that the data must be 4-byte (perhaps 2-byte?) aligned, or the code may trip on itself. It only takes up 8 more bytes of space, so if size isn't what you are optimizing for, this may as well be the best you can do without loop unrolling.

#73611 - FluBBa - Sun Feb 26, 2006 10:16 pm

Should the extra sub be on carry set? Or am I missing something?
_________________
I probably suck, my not is a programmer.

#73620 - strager - Sun Feb 26, 2006 11:42 pm

FluBBa wrote:
Should the extra sub be on carry set? Or am I missing something?

I'm not so sure of that. I assumed that "carry" meant that some bits of the number were modified because the bits left one on both sides were both one. E.g., 01 + 01 = 10, because both bit zeros are 1. Overflow means that the operations caused the number to go past the 31'nd bit, I think. Perhaps you were thinking of the logical operators, where when a bit is shifted off the register, the carry flag is set?

Maybe I am way off; I don't know, I haven't tried it.

#74324 - FluBBa - Sat Mar 04, 2006 12:41 pm

Overflow means that a number goes from it's max signed value over to it's lowest signed value (or the other way around as well).

like this: 0x7F800000+0x00821220=0x80021220
Overflow will be set, the value goes from a positive to a negative by adding 2 positive values.
I don't have any good site which explains it better byt I suppose Google could help you.

Carry is used when the value wraps from its max unsigned value.
_________________
I probably suck, my not is a programmer.