#161657 - SDy - Sat Aug 09, 2008 1:38 pm
Greetings everyone,
After attempting to improve the interrupt handling in libnds (version 20071023), I end up asking myself several questions :
Which one of the CPU is "supposed" to be handling the interrupts ? (i.e. on which one of them does the interrupt handler runs)
I'm not sure where the irqTable is located since it is declared with a "section(.itcm)" attribute but only the ARM9 has such section.
How many cycles does it require for the ARM7 to fetch a 32-bit word from its IWRAM ?
The best source of info I could find is
http://nocash.emubase.de/gbatek.htm#dsmemorytimings
but the column marked as "BUS" in those tables is not described, is it simply the bus width for transferring stuff to/from those memories ? Also the IWRAM timings are not given :s
Any help would be appreciated :)
S.
Last edited by SDy on Sat Aug 09, 2008 1:53 pm; edited 1 time in total
#161658 - Dwedit - Sat Aug 09, 2008 1:50 pm
Both CPUs have their own interrupt handler. What are you trying to get at?
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#161660 - SDy - Sat Aug 09, 2008 2:08 pm
By looking at the code for the interrupt handler as it is, the fact of going through the irqTable, comparing the mask then fetching the handler address when it's the right one seems not extremely efficient to me (as the size of the table increases). I'd imagine that an interrupt handler should grab the handler's address in a constant time, not based on its position within the table.
Also, the number of memory accesses should be kept as low as possible (because they're usually *very* slow).
I was thrilled to see the clz instruction which would allow me to compute the offset within the table in 2 instructions, but it's only available on v5 and above ARM.
The main problem I'm bumping into is how to compute the index of a set bit in an interrupt mask. The best solution I found requires around 10 arithmetic instructions. So what I'm trying to figure out is, which one is the fastest ?
S.
#161661 - tepples - Sat Aug 09, 2008 4:30 pm
When I used to write my own ISRs, I would unroll the dispatching loop by hand, checking for each interrupt that I was interested in. Usually this was only vblank, and the ISR was short.
_________________
-- Where is he?
-- Who?
-- You know, the human.
-- I think he moved to Tilwick.
#161666 - Maxxie - Sat Aug 09, 2008 7:45 pm
SDy wrote: |
By looking at the code for the interrupt handler as it is, the fact of going through the irqTable, comparing the mask then fetching the handler address when it's the right one seems not extremely efficient to me (as the size of the table increases). I'd imagine that an interrupt handler should grab the handler's address in a constant time, not based on its position within the table.
|
- It's limited by a constant time. 32 iterations through a shift (to get the next bit) and tst (to check the flag's state) instruction shouldn't be troubling you) the other effort for things like getting the function address keeps the same.
- Interrupt handling of the flags in REG_IF in constant times for all IRQs is impossible. There can be more then one flag be valid at the same time, in which you allways have to do one first, thus having a delay for the other.
- doing the lookup itself in constant time is possible through lookup tables, but not for the DS with it's 4MiB and 32bit IF. On 8bit machines it's could save a little time on the cost of memory.
Quote: |
Also, the number of memory accesses should be kept as low as possible (because they're usually *very* slow).
|
The iterating and testing method below needs 4 accesses (IE addr, IE&IF content, table) + 4 stack operations allways, and n (table index) accesses and n*8 stack operations(the call & recovering), with n as the number of enabled and set bits.
(edit: actually reduced stack operations on set bit below by 2)
(not checked but this should point the way)
Code: |
STMDB sp!,{r4,lr}
LDR r0,=REG_IE_ADDR
LDR r1,[r0]
LDR r0,[r0,#4]
AND r0,r0,r1
MOV r1,#0
MOV r2,#1
LDR r4,=Table
loop:
TST r0,r2 lsl r1
ADD r1,r1,#1
BNE check
CMP r0,#32
BEQ end
B loop
check:
LDR r3,[r4,r1 lsl #2]
TST r3,r3
BEQ loop
STMDB sp!,{r0-r1,r3}
BL r3
LDMIA sp!,{r0-r1,r3}
MOV r2,#1
B loop
end:
LDMIA sp!,{r4,lr}
|
Quote: |
I was thrilled to see the clz instruction which would allow me to compute the offset within the table in 2 instructions, but it's only available on v5 and above ARM.
|
And it would only give you the highest enabled bit, leaving you with the problem that you need to repeat the test after clearing this bit and thus not doing it in constant time.
_________________
Trying to bring more detail into understanding the wireless hardware
#161676 - eKid - Sun Aug 10, 2008 1:38 am
The libnds irq handler just stores a bitmask with each entry. If you only have 3 irqs enabled, the 'find irq' loop will only iterate up to three times. (also gives a simple priority system)
ARM7 can access iwram in 3 cycles.
#161680 - Cydrak - Sun Aug 10, 2008 6:49 am
Both processors handle their own interrupts (and by necessity IME, IE and IF are local to each CPU). Some devices and their IRQs stick to one side. Others (notably DMA, FIFO and timers), come in pairs.
He's listed IWRAM as "WRAM," and EWRAM as "Main RAM." I think bus width is right. Keep in mind, the table doesn't include the cycles related to LDR itself.
I'm no ARM guru, but I think the table is reasonable. It's already in fast memory and usually pretty tiny. Come to think, I bet you could even get that loop down to 4 ops, less if it's unrolled:
Code: |
; Untested, should be pretty obvious though..
orr r1, #(1<<31) ; r1 = pending IRQ mask
ldr r2, =irqTable ; (+sign bit to test end marker)
findIRQ:
ldr r0, [r2], #8
ands r0, r1
bgt callHandler ; > 0 if matched something
beq findIRQ ; = 0 if no match, but more entries
|
My gut feeling is 10 ALU ops should break even with 1-2 iterations here. But the mask table gives you something over shifting.. you can order them by priority and (in some cases?) service related IRQs in the same handler. In other words, although there might be "worse" latency, you have more control where it goes.
So, what are you trying to do with this? If it's important, I can only suggest to time it and see. Sometimes there is no "fastest" general purpose solution, in which case whatever you do will be either 1) a tradeoff (for better or worse), or 2) "good enough."
#161701 - Miked0801 - Mon Aug 11, 2008 5:01 pm
Optimizing the interrupt handler table isn't a bad idea to get some generic perforamance back. One problem though is you have to push every register you play with. That in and of itself adds overhead that is hard to overcome to a reasonable degree.
Anyways, if you wanted to trade a bunch of space for speed, you could skip the masking altogehter and just feed the if flag directly into a table of vectors and eitehr null terminate it for end check or store the number of vectors in the first position to check instead. No masking fun, just stepping through an ordered table of functions. Of course, a table 8-bits long is 256 entries of up to 8 functions. Straight state machine code.
#161703 - Dwedit - Mon Aug 11, 2008 6:56 pm
When I made my really tiny ISR for my GBA sleep hack patching program, I took advantage of the fact that r0 initially contains 0x04000000. I managed to squeeze in a check for the vblank or keyboard interrupt and a jump to the previous ISR if neither interrupt is set into 5 instructions (plus two words to hold a couple addresses).
Does the NDS also set r0 to 0x04000000 before calling the user ISR? If so, that's one less instruction you need.
_________________
"We are merely sprites that dance at the beck and call of our button pressing overlord."
#161722 - SDy - Tue Aug 12, 2008 12:24 pm
perhaps some code can help. My first idea was the following :
Code: |
ldr r0,=irqTable
ldr r1,[REG_IE]
ldr r2,[REG_IF]
ands r1,r1,r2 @ r1 = interrupts to handle
beq end
str r1,[REG_IF] @ acknowledge interrupts that will be handled
loop:
@ the two following instructions are the x &= ~(x-1) trick which allows to extract the lowest bit set
sub r3,r1,#1
bic r3,r1,r3
@ r3 contains the lowest interrupt bit which is set, so clz will give us the offset within irqTable where the handler to handle it is located
@ (assuming irqTable is ordered following : irqTable[0] = IRQ_HANDLER_BIT_24 .. irqTable[24] = IRQ_HANDLER_BIT_0)
clz r3,r3
sub r3,r3,#7 @ because there are only 25 interrupts we consider. If clz returns some number 'n', it should be between 7 and 31. So n-7 gives us [0:24]
ldr r2,[r0, r3]
stmfd sp!,{r0-r1,lr}
bx r2
ldmfd sp!,{r0-r1,lr}
@ the following two instructions are the x &= (x-1) trick which clears the lowest set bit in r1.
sub r3,r1,#1
ands r1,r1,r3
bne loop
end: |
Now, in this case, at each iteration, one interrupt will be handled, requiring one fetching from memory. Also, I cut the irqTable size in half since I don't need to store the irqMask for the lookup.
My problem here is that clz is not available on ARM7TDMI, so I have to compute myself the offset within irqTable where the handler is located.
The code I have obtained so far is based on the idea found here {check the // OR (IF YOU KNOW v IS A POWER OF 2) section of the code}
and is as follows :
(Assuming that r1 equals the mask of the interrupt for which we are computing the offset. Comments follow. I haven't tested this yet so a C code follows in case this one contains errors)
Code: |
mov r2,#0
lsrs r3,r1,#0x10
orrne r2,r2,#0x10
moveq r3,r1
lsrs r3,r3,#0x8
orrne r2,r2,#0x8
moveq r3,r1,lsr r2
lsrs r3,r3,#0x4
orrne r2,r2,#0x4
moveq r3,r1,lsr r2
lsrs r3,r3,#0x2
orrne r2,r2,#0x2
moveq r3,r1,lsr r2
orrne r2,r2,r3,lsr #0x1
|
Pseudo-code :
Code: |
r1 = REG_IE | REG_IF;
r2 = 0;
r3 = r1;
if (r3 >>= 0x10) {r2 |= 0x10;} else {r3 = r1;}
if (r3 >>= 0x8) {r2 |= 0x8;} else {r3 = r1 >> r2;}
if (r3 >>= 0x4) {r2 |= 0x4;} else {r3 = r1 >> r2;}
if (r3 >>= 0x2) {r2 |= 0x2;} else {r3 = r1 >> r2;}
if (r3 >>= 0x1) {r2 |= 0x1;}
|
The idea is to shift-right r1 by 16 positions and update the Z flag. If the set bit is in [31:16], the shift-right will not zero r1 (and Z=1). This can be used as the condition to whether the set bit is within [31:16] or [15:0].
This is basically a search in a binary tree where the bit position corresponds to the sequence of five boolean values obtained from asking at each point the question "Is it in the left-half of the word we're considering or the right-half ?" We then do it successively for the shift-right by 8, 4, 2 and 1.
r2 now contains the offset we can add to the irqTable address to find the address to which bx will branch.
Some replies :
Maxxie wrote: |
- Interrupt handling of the flags in REG_IF in constant times for all IRQs is impossible. There can be more then one flag be valid at the same time, in which you allways have to do one first, thus having a delay for the other. |
What I meant was to get the address of a handler in constant time. If we have 10 bit set in REG_IE, and the one for which the handler is at the end of the irqTable occurs, the loop will go through all of them (9 useless ldr's) before actually finding the right one. I know that servicing is linear in the number of bits set in REG_IE.
Cydrak wrote: |
So, what are you trying to do with this? If it's important, I can only suggest to time it and see. Sometimes there is no "fastest" general purpose solution, in which case whatever you do will be either 1) a tradeoff (for better or worse), or 2) "good enough." |
What I want is to optimize the way interrupts are serviced here. I realize that in some cases it might be more interesting to write your own interrupt handler (the main one), for instance when there is 1 or 2 bits set in REG_IE. Perhaps only some local improvements can be made (merging some instructions, restructuring the code, etc.).
How exactly would you time it ? Does no$gba or desmume offer timing facilities ?
EDIT: ok, no$gba does provide profiling capabilities but I'm not willing to pay 750$ (yet ^^)
Miked0801 wrote: |
Optimizing the interrupt handler table isn't a bad idea to get some generic perforamance back. One problem though is you have to push every register you play with. |
Actually, I read that the BIOS-level interrupt handler (which branches to the interrupt handler I'm modifying) stores r0-r3 before branching (scrolldown to BIOS Interrupt Handler) so I'm trying to exclusively use r0-r3 if possible.
Miked0801 wrote: |
Anyways, if you wanted to trade a bunch of space for speed, you could skip the masking altogehter and just feed the if flag directly into a table of vectors |
Precisely what I'm trying to do :)
Dwedit wrote: |
Does the NDS also set r0 to 0x04000000 before calling the user ISR? If so, that's one less instruction you need. |
Again, according to this, it doesn't seem to do it. Besides the location of irqTable is not hardware-hardcoded, only REG_{IE,IF,IME} are.
EDIT : my mistake; it does seem to place 0x4000000 in r0 before jumping to the main interrupt handler
Code: |
0000012C mov r0,4000000h ;ptr+4 to 03FFFFFC (mirror of 03007FFC)
00000130 add r14,r15,0h ;retadr for USER handler $+8=138h
00000134 ldr r15,[r0,-4h] ;jump to [03FFFFFC] USER handler |
S.
#161726 - Cydrak - Tue Aug 12, 2008 2:32 pm
SDy wrote: |
What I want is to optimize the way interrupts are serviced here. I realize that in some cases it might be more interesting to write your own interrupt handler (the main one), for instance when there is 1 or 2 bits set in REG_IE. Perhaps only some local improvements can be made (merging some instructions, restructuring the code, etc.). |
I was moreso curious what you need the fast ISR for, or if it's just academic. For me libnds' is good enough, and I guess I appreciate the straightforward code. If you do optimize, you probably want to test if it really *is* faster, in which case the conditions are relevant...
SDy wrote: |
How exactly would you time it ? Does no$gba or desmume offer timing facilities ? |
Buying NO$GBA aside (since it seems unavailable at the moment), a couple things come to mind. Obviously, if it's impacting the app's performance, that can be measured. Otherwise the hardware timers can give reasonable cycle counts.
You can read TIMER_DATA, subtracting the reload value to get the time since last rollover. In a timer handler this would give some indication of ISR latency. Another option would be to set the timer, arrange for an IRQ in the next few cycles, and measure (adding NOPs, if needed, until it shows up) the roundtrip time through the ISR.
SDy wrote: |
Actually, I read that the BIOS-level interrupt handler (which branches to the interrupt handler I'm modifying) stores r0-r3 before branching |
Yep. It saves r0-r3,r12,lr_irq. Notice though that the two BIOS are different. The ISR will see r0 = 0x04000000 (ARM7) or DTCM + 0x4000 (ARM9; normally 0x0b004000, IIRC).
#161729 - SDy - Tue Aug 12, 2008 4:52 pm
You could say it's more academic. I never got my hands on such low-level coding. I studied some assembly and know how hardware works, so I figured why not try toying with the libnds to put that knowledge to practice.
My objective is to get a full understanding of what happens deep down inside the library. Improving it in the process is a plus.
Thanks for the advices Cydrak. And thanks to the others of course ^^
S.