Tuesday, April 03, 2007

Using Microsoft Visual C++ Express to compile RVISS model

One of the great things about RealView Instruction Set Simulator (RVISS/ARMulator) is that it supports creating device models. However, you would need to have Microsoft Visual C++ to compile the models. Thankfully, since 2005, Microsoft has offered a free version within its Microsoft Visual Studio Express Editions. (Of course, you need to eventually register to use it indefinitely and if I am not mistaken the registration requires you to run the dreaded "genuine system check".) It comes complete with a nice IDE. This is a far cry from what we would have got from Microsoft Visual C++ Toolkit 2003 long time back.

The easiest thing is to use an existing device model from RVDS installation as a base and modify it from there. For example, let assume you want to use the timer model as a base. You should be able to locate file timer.c and folder timer.b in RVARMulator\ExtensionKit\1.4.1\win-32-pentium\armulator in your RVDS installation folder. Make copy of each into mytimer.c and mytimer.b, respectively. You should be able to locate the Makefile within the mytimer.b folder. In the Makefile, rename all 'timer' occurrences into 'mytimer' and 'Timer' into 'MyTimer' in the Makefile. You could choose any other names you like for 'mytimer' and 'MyTimer', of course.

The best way to compile the device model is to use the "Visual Studio 2005 Command Prompt" from the Start menu > All Programs > Visual C++ 2005 Express Edition > Visual Studio Tools. This custom command prompt sets up the appropriate environment variables to effectively use MSVC++ Express from the command line. If you prefer to configure your own environment variables, make sure you add <install>\VC\INCLUDE into INCLUDE, <install>\VC\LIB and <install>\Common7\IDE;<install>\VC\BIN into Path, where <install> is where you have installed MSVC Express. In particular, if you forgot to include <install>\Common7\IDE;, you would get this when you issue nmake:

cl /c /Za /I..\.. /I..\..\..\armulif /I..\..\..\rdi
/I..\..\..\clx /D_CONSOLE /D_MBCS /DNLS /nologo
/W3 /GX /GR /WX -DRDI_VERSION=151 -DARM10MODEL
/O2 /G6 /MD /DNDEBUG
/DARM_RELEASE="\"RVARMulatorISS1.4.1\""
/DBUILD_NUMBER=290 /Iderived /Fomytimer.obj
..\..\mytimer.c
NMAKE : fatal error U1077: '...\cl.EXE' : return
code '0xc0000135'

And if you explicitly issue the linking command above you would get an "mspdb80.dll not found" error. So make sure <install>\Common7\IDE in your Path environment variable.

Once you have got the environment variables out of the way, when you issue nmake you would still get a small error as shown below:

..\..\..\armulif\perip_sordi.h(184) : error C2220:
warning treated as error - no 'object' file
generated
..\..\..\armulif\perip_sordi.h(184) : warning C4996:
'sprintf': This function or variable may be unsafe.
Consider using sprintf_s instead. To disable
deprecation, use _CRT_SECURE_NO_WARNINGS. See
online help for details.
D:\msvcx\VC\INCLUDE\stdio.h(345) : see declaration
of 'sprintf'
NMAKE : fatal error U1077: '...\cl.EXE' : return
code '0x2'

ARM has chosen to treat every warning as error with option /WX. Nobel, yes, but it is not future-proof. Apparently, with the latest Visual C++ sprintf and other buffer clobbering functions that do not specify the buffer length would generate a warning. Removing the option from the Makefile would do the trick.

Friday, January 12, 2007

Mac OS X Embedded?

Apple iPhone has been revealed recently and it gets me all excited. The phone itself is nice but I am more interested in the OS behind it. It is none other than the Mac OS X itself. What processor it is running on, by the way?

I'm wondering whether this "embedded" OS X would be available to us, embedded developers, to tinker with. Knowing Apple, however, I don't think this would happen. Maybe, it is a good thing. I haven't bought anything from Apple Inc. yet. I would love to have the MacBook with Intel Core Duo but my wife won't let me. Anyway, with Apple products, they tend to be very consistent. You know what to expect. So embedded Darwin OS, anyone?

Now, on the iPhone product itself, why only EDGE? Why not 3G? Also, why it would only be available in 2008 for us in Asia? We can't wait that long. And please support a 'virtual' 12-button keypad. I know I am a dinosaur but I am used to it and from, the look of it, the virtual QWERTY keyboard buttons are too small for my fingers.

Let the new age in personal mobility begin!

Thursday, January 11, 2007

RVDS ARMulator and GNUARM-compiled binaries

I have started to use the ARMulator , which is also called the RealView Instruction Set Simulator (RVISS). I was interested to see if the binaries compiled using the GNUARM toolchain could work with the ARMulator. Specifically, I was hoping that the input/output (from printf or fgets, for example) could be done from within the RealView Debugger (RVD) session itself. To my surprise, the GNUARM-compiled binaries would not show printf output when ran using the RVD/ARMulator even though they worked just fine with GDB. I have customized my crt0.S, so that was the first place I looked. In addition, writing a simple code using printf and fgets and compiling it using GNUARM without using my custom crt0.S seemed to work. Lo and behold, after comparing my crt0.S and the default one within the GNUARM distribution, I found that I had failed to call initialise_monitor_handles in my crt0.S. Doing this after stack initialization and BSS clearing fixed the problem.

(Apparently, the problem above only appears in GNUARM toolchain from gnuarm.org. The toolchain from CodeSourcery does not exhibit this problem. In fact, the latter does not even have the initialise_monitor_handles function. Oh, well.)

Tuesday, January 09, 2007

RealView Development Suite and Eclipse

I am starting to use RVDS from ARM. I'm using it mainly for its debugger (RVD) and its simulator (RVISS). I don't know if I will ever use the CodeWarrior IDE that comes with it. However, since ARM supports a number of Eclipse plugins for RVDS, and I have been using Eclipse for most of my embedded work, I thought I would give the plugins a try.

Download the Eclipse plugins and documentation from the ARM website. Even though the RVDS Eclipse plugins page mentions only Eclipse 3.1 and CDT 3.0.0, they work just fine with Eclipse 3.2 and CDT 3.1.0 I am using.

To install the plugins, simply follow the instructions in the RVDS Eclipse Plugins User Guide. Once installed, you can start creating your first RVDS "Hello World!" program in Eclipse. Alternatively, follow the User Guide to create an ARM or ARM/Thumb interworking project.

The User Guide also mentions that GNU Make is required and we should be able to use either the one from Cygwin or from MinGW. I find, however, that if you use Cygwin Make, the first build of a project will succeed but subsequent builds will fail with this error:

hello-rvds.d:6: *** target pattern contains no `%'. Stop.

Or something to that effect. It has something to do with the Cygwin Make program inability to understand Windows full pathname which includes the drive letter followed by a colon. So, your only option is to download the GNU Make from MinGW. It is an installer. Once installed, put the path to the Make program in your PATH environment variable. Since the program is named mingw32-make.exe, which is different from the one for Cygwin, you do not have to worry about name clash.

One last detail, you also need to replace the default make with mingw32-make in your project properties > C/C++ Build > Build Settings as shown below:

Saturday, September 02, 2006

A case of premature optimization

I had a case of premature optimization recently. I need to learn more self-control: to take it easy and let the compiler does what it does best.

I had a few tests I needed to write and they had to be super tight because they would run in gate level simulation (GLS) where time is a luxury the verification team do not have.

My other more normal test programs make use of the newlib library and I have implemented system clock and console to make writing test programs more palatable. However adding newlib makes test programs big and slow to complete in a simulation environment. So, for the GLS, I decided to make the test programs as lean as possible. This means not using newlib, or any of the libraries for that matter. Which also means no standard C library and the math emulation (because ARM7 do not have math coprocessor and does not have divide instruction.)

As an alternative to using printf and friends, I wrote a function to write to a GPIO port instead:

static void lean_vout(char *s, ...) {
va_list ap;
va_start(ap, s);
lean_out(s);
while ( (s=va_arg(ap,char*)) != 0 )
lean_out(s);
va_end(ap);
}

Where lean_out prints each character in the string to the GPIO port. Printing to GPIO is fast and the verification team can easily see the characters in the simulation signal viewer.

Beside, the ability to see what is happening in a test program, the ability to do assertions is important too. I cannot use the facility in the <assert.h> because it makes use of printf. So, I have this macro for my lean-and-mean assertion:

#define lean_assert(cond) if (!(cond)) lean_do_assert(__FILE__, __LINE__);

Basically, when the assertion fails, the filename and the line number would scuttle through the GPIO port. But there is a small problem: I need to convert the line number into its ASCII representation. Because, recently, I had a program failing to compile because I use division with -nostdlib, I thought I needed to use an algorithm to divide by 10 without actually using the division operator. And find I did: Section 10 to the "Hacker's Delight" book which talks about division by constant in great depth. Based on the information in this section, I came up with my version of itoa for converting integer to base-10 ASCII representation:

/* taken straight from "Hacker's Delight" */

static int remu10(unsigned n) {
static char table[16] =
{0, 1, 2, 2, 3, 3, 4, 5,
5, 6, 7, 7, 8, 8, 9, 0};
n = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
return table[n];
}

char *lean_itoa(int n, char *s, int len) {
int r;
s += len;
*--s = '\0';
while (n && len > 0) {
r = remu10(n);
n = ((n - r) >> 1)*0xCCCCCCCD; /* also from "Hacker's Delight" */
*--s = '0' + r;
--len;
}
return s;
}

Basically, the remainder is computed and the divide-by-10 is done using the remainder. The generated code from arm-elf-gcc (with optimization -O2):

0000820c <lean_itoa>:
820c: e0821001 add r1, r2, r1
8210: e3a03000 mov r3, #0 ; 0x0
8214: e3500000 cmp r0, #0 ; 0x0
8218: 13520000 cmpne r2, #0 ; 0x0
821c: e92d4010 stmdb sp!, {r4, lr}
8220: e1a0c002 mov ip, r2
8224: e5413001 strb r3, [r1, #-1]
8228: e2411001 sub r1, r1, #1 ; 0x1
822c: da000018 ble 8294 <lean_itoa+0x88>
8230: e59f4064 ldr r4, [pc, #100] ; 829c <.text+0x27c>
8234: e1a0e001 mov lr, r1
8238: e0803080 add r3, r0, r0, lsl #1
823c: e0803103 add r3, r0, r3, lsl #2
8240: e0633303 rsb r3, r3, r3, lsl #6
8244: e0803103 add r3, r0, r3, lsl #2
8248: e0633703 rsb r3, r3, r3, lsl #14
824c: e0803183 add r3, r0, r3, lsl #3
8250: e08330a0 add r3, r3, r0, lsr #1
8254: e08331a0 add r3, r3, r0, lsr #3
8258: e7d41e23 ldrb r1, [r4, r3, lsr #28]
825c: e0612000 rsb r2, r1, r0
8260: e1a020c2 mov r2, r2, asr #1
8264: e0823082 add r3, r2, r2, lsl #1
8268: e0833203 add r3, r3, r3, lsl #4
826c: e0833403 add r3, r3, r3, lsl #8
8270: e0833803 add r3, r3, r3, lsl #16
8274: e0820103 add r0, r2, r3, lsl #2
8278: e24cc001 sub ip, ip, #1 ; 0x1
827c: e2811030 add r1, r1, #48 ; 0x30
8280: e3500000 cmp r0, #0 ; 0x0
8284: 135c0000 cmpne ip, #0 ; 0x0
8288: e56e1001 strb r1, [lr, #-1]!
828c: caffffe9 bgt 8238 <lean_itoa+0x2c>
8290: e1a0100e mov r1, lr
8294: e1a00001 mov r0, r1
8298: e8bd8010 ldmia sp!, {r4, pc}
829c: 0001085b andeq r0, r1, fp, asr r8

From the look of it, the compiler has tried very hard to use additions and shifts to perform the constant multiplications. This led me into thinking that, maybe, the compiler would be clever enough to avoid doing division in division by constants (of course, it is but it didn't occur to my thick head.) I tried this instead:

char *lean_itoa(int n, char *s, int len) {
int r;
s += len;
*--s = '\0';
while (n && len > 0) {
r = n % 10;
n = n / 10;
*--s = '0' + r;
--len;
}
return s;
}

And the generated assembly code:

0000820c <lean_itoa>:
820c: e0821001 add r1, r2, r1
8210: e3a03000 mov r3, #0 ; 0x0
8214: e3500000 cmp r0, #0 ; 0x0
8218: 13520000 cmpne r2, #0 ; 0x0
821c: e92d4010 stmdb sp!, {r4, lr}
8220: e5413001 strb r3, [r1, #-1]
8224: e1a0e002 mov lr, r2
8228: e2411001 sub r1, r1, #1 ; 0x1
822c: da00000f ble 8270 <lean_itoa+0x64>
8230: e59f4040 ldr r4, [pc, #64] ; 8278 <.text+0x258>
8234: e1a0c001 mov ip, r1
8238: e0c13094 smull r3, r1, r4, r0
823c: e1a03fc0 mov r3, r0, asr #31
8240: e0633141 rsb r3, r3, r1, asr #2
8244: e1a02003 mov r2, r3
8248: e0833103 add r3, r3, r3, lsl #2
824c: e0403083 sub r3, r0, r3, lsl #1
8250: e24ee001 sub lr, lr, #1 ; 0x1
8254: e2833030 add r3, r3, #48 ; 0x30
8258: e3520000 cmp r2, #0 ; 0x0
825c: 135e0000 cmpne lr, #0 ; 0x0
8260: e1a00002 mov r0, r2
8264: e56c3001 strb r3, [ip, #-1]!
8268: cafffff2 bgt 8238 <lean_itoa+0x2c>
826c: e1a0100c mov r1, ip
8270: e1a00001 mov r0, r1
8274: e8bd8010 ldmia sp!, {r4, pc}
8278: 66666667 strvsbt r6, [r6], -r7, ror #12

So, yes! The compiler avoids calling the division function. The previous generated code has 36 instructions with one load instruction for the table. The second one has 27 instructions with one load instruction used in the "signed multiply long" instruction. Which one is faster? I presume the compiler concluded that using the long multiplication is faster than additions-and-shifts (See also Re: Optimization question). But surely the second code is more compact.

This led me into digging further on how long divide-by-constant optimization has been around in GCC. Apparently, since 1994 when Torbjorn Granlund and Peter L. Montgomery wrote about it in their paper, "Division by Invariant Integers using Multiplication."

So there you have it. A case of trying to hard. I should've let GCC do its job. But I have learned a lot from the whole process.

Monday, August 14, 2006

Recursive Make Considered Harmful -- Building Multiple Programs

I have a number of test programs in a testsuite for an embedded system. They may share source files among them in addition to some other common files. I do not feel like managing a number of full-blown makefiles in building these programs. Besides, Peter Miller has written a very excellent paper on "Recursive Make Considered Harmful". He argued that multiple makefiles with common dependencies simply do not work.

The paper above shows how to handle building a program with dependencies that come from multiple directories. How would we then build multiple programs using a single makefile without too much boilerplate cut-and-paste? The latest GNU make 3.81 with (correct) eval support makes this very easy. Just like in the paper, I depend on GCC to generate source code dependencies and also on sed, the stream editor, to transform the dependency output to what I want. But first, below is the structure of my build directory:



The build directory contains the makefile and the other directories are "module" directories.

Similar to the paper, I put makefile include file module.mk in each module directory. The source files other modules may depend on are exported through the SRC variable. For my case, however, a module may also have one or more source files that contribute to the implementation of the main function for the module (let's call this module program module.) For this purpose, I define PROG_SRC variable for declaring these source files. An example from module.mk from my dhrystone module is given below.
SRC +=
PROG_SRC := dhrystone/dhry_1.c dhrystone/dhry_2.c

Below is another example of module.mk from my common module, which do not have a main function.
SRC += common/vec.S common/swi.S common/crt0.S \ 
common/int-handler.c common/swi-handler.c \
common/intr-write.c common/swi-write.c \
common/swi-clock.c common/console.c

In my singular makefile, I have variable PROGS to declare all program modules and COMMS for other "common" modules. For example:
PROGS := simple unmapped rap fiq dhrystone large
COMMS := common

To declare the modules a program module depends on, I declare makefile variable with name like <program-module>_MK. Basically, this variable lists the module.mk files from the modules the program module depends on. An example is given below for my simple program module:
simple_MK := ../common/module.mk ../simple/module.mk

The above simply says that the simple module depends on the common module and its own. Note that the "self" module.mk must be last in the list so that its PROG_SRC variable will take effect.

Given that we have defined all of the program modules and the module.mk files each depends on, how do we go about creating the rules for building the program modules? Copy-and-paste is one option but where is the fun in that? What we need is rule building template which can be "instantiated" through eval magic. The template is given below:
define PROG_template
SRC :=
include $$($(1)_MK)
$(1)_OBJ := $$(call get_objs,$$(patsubst %, ../%,$$(PROG_SRC) $$(SRC)))

$(1): $$($(1)_OBJ)
$$(CC) $$(LDFLAGS) -o $$@ $$^
endef

So if you were to call PROG_template with argument simple, make would have emitted this text:
SRC :=
include $(simple_MK)
simple_OBJ := ../simple/simple.o ../common/vec.S # ...and other common files

simple: $(simple_OBJ)
$(CC) $(LDFLAGS) -o $@ $^

All module.mk files simple depends on are first included, thus filling in the SRC and PROG_SRC variables. Next, the object files it depends on are listed followed by the rule to build simple. By the way, function get_objs called above converts all .c and .S filenames into .o filenames:
define get_objs
$(patsubst %.S,%.o, $(filter %.S,$(1))) $(patsubst %.c,%.o, $(filter %.c,$(1)))
endef

Now, the trick is to "evaluate" the emitted text for all of the program modules:
$(foreach t,$(PROGS),$(eval $(call PROG_template,$(t))))

And finally, one rule to build all program modules:

progs: $(PROGS)

That is all there is to it. Below is the complete makefile for your reference.
.SUFFIXES:
.SUFFIXES: .h .c .S .o .lst .sym .d

# list the test program to build. There should be a directory for each
# test program with the same name. COMMS are for directories that contains
# common source files but do not contain main function.

PROGS := simple unmapped rap fiq dhrystone large
COMMS := common

# define the module.mk files to include for building each test. Each
# module.mk defines the source files it exports in variable SRC and,
# if the module defines a program, the program source file(s) in
# PROG_SRC. Put the module.mk for the program last.

simple_MK := ../common/module.mk ../simple/module.mk
unmapped_MK := ../common/module.mk ../unmapped/module.mk
rap_MK := ../common/module.mk ../rap/module.mk
fiq_MK := ../common/module.mk ../fiq/module.mk
dhrystone_MK := ../common/module.mk ../dhrystone/module.mk
large_MK := ../common/module.mk ../large/module.mk

# !!!DO NOT CHANGE ANYTHING BELOW THIS LINE!!!

TARGET:=arm-elf
CC:=$(TARGET)-gcc
AS:=$(TARGET)-as
LD:=$(TARGET)-ld
OBJDUMP:=$(TARGET)-objdump
NM:=$(TARGET)-nm
OBJCOPY:=$(TARGET)-objcopy
STRIP:=$(TARGET)-strip
RM:=rm

SED:=sed
LDSCRIPT:=zero.ld

XDEFINES:=-DHZ=100
XINCLUDES:=$(COMMS:%=-I../%) $(PROGS:%=-I../%)
DFLAGS:=-g
OFLAGS:=-O2 -fomit-frame-pointer
WFLAGS:=-ansi -Wall -Wstrict-prototypes -Wno-trigraphs
CFLAGS:=-mcpu=arm7tdmi $(DFLAGS) $(OFLAGS) $(WFLAGS) \
$(XDEFINES) $(DEFINES) $(XINCLUDES) $(INCLUDES)
ASFLAGS := $(SDDEFINES) $(XDEFINES) $(DEFINES) $(XINCLUDES) $(INCLUDES)
LDFLAGS := $(CFLAGS) -Wl,--script=$(LDSCRIPT) -nostartfiles

#objdump flags for generating listing files
ODFLAGS :=

.PHONY: all progs listings clean clean-dep clean-all

all: progs listings

progs: $(PROGS)

listings: $(PROGS:%=%.lst) $(PROGS:%=%.sym)

# deduce object files from .S and .c files

define get_objs
$(patsubst %.S,%.o, $(filter %.S,$(1))) $(patsubst %.c,%.o, $(filter %.c,$(1)))
endef

# an eval template for deducing object files and rule for a program.
# $(1) is the test program name.

define PROG_template
SRC :=
include $$($(1)_MK)
$(1)_OBJ := $$(call get_objs,$$(patsubst %, ../%,$$(PROG_SRC) $$(SRC)))

$(1): $$($(1)_OBJ)
$$(CC) $$(LDFLAGS) -o $$@ $$^
endef

# generate the rule for building each of the programs

$(foreach t,$(PROGS),$(eval $(call PROG_template,$(t))))

# generate all of the object files from all test test programs
# and use it to include all of the dependency files.

ALL_OBJ :=
$(foreach t,$(PROGS),$(eval ALL_OBJ += $$($(t)_OBJ)))
ALL_OBJ := $(sort $(ALL_OBJ))

include $(ALL_OBJ:.o=.d)

clean:
-$(RM) $(PROGS) $(PROGS:%=%.lst) $(PROGS:%=%.sym) \
$(ALL_OBJ)

clean-dep:
-$(RM) $(ALL_OBJ:.o=.d)

clean-all: clean clean-dep

%.lst: %
$(OBJDUMP) $(ODFLAGS) -d $< > $@

%.sym: %
$(NM) -n $< > $@

%.d: %.c
@echo generating dependencies from $<
@$(CC) $(CFLAGS) -MM -MG $< | \
$(SED) 's%^\(.*\)\.o%$(dir $@)\1.d $(dir $@)\1.o%' > $@

%.d: %.S
@echo generating dependencies from $<
@$(CC) $(ASFLAGS) -MM -MG $< | \
$(SED) 's%^\(.*\)\.o%$(dir $@)\1.d $(dir $@)\1.o%' > $@

(Note: You need to change the space characters before each command in a rule with a tab!)

Tuesday, July 11, 2006

Newlib Angel SWI handing in C

Previously, I have shown how to implement Angel SWI write command so that printf and friends would work on a UART. That implementation was rather simplistic because the UART writing loop will hold the caller from doing any other useful stuff. What I want to do now is to make use of FIFO in the UART and also interrupt it only when the FIFO is empty enough.

But first, I want to change the SWI handling from using ARM assembly code into C because the implementation now has become more sophisticated. Before you continue reading, I think you should also check the thread I started in the GNUARM mailing list on this subject where there were a lot of insights that can be gathered from the contributors to the thread.

It seems like GCC supports C function that can act as the handler to the various exceptions that ARM can throw through the __attribute__ modifier. Below is an example of a SWI handler that would accept an Angel SWI call from newlib and see if it is a write command and write the bytes in the string to a specific location.

static int *out = (int *) 0x08000000;

int __attribute__((interrupt("SWI")))
handle_swi(int reason, void *args) {

int i, n, *a;
char *s;
if (reason != 5) return -1;
a = (int*) args;
s = (char *) a[1];
n = a[2];
for (i=0; i<n; ++i) *out = s[i];
return 0;
}

Note that for Angel SWI write to work, we need to return the number of bytes yet to be written. So zero is returned above because the function has written out all of the bytes. Now, if we link this function to the ARM vector address for SWI, it looks like it may work. But look closely at the generated assembly code for the function:

00000000 <handle_swi>:
0: cmp r0, #5 ; 0x5
4: stmdb sp!, {r0, r1, r2, r3, ip}
8: mvnne r0, #0 ; 0x0
c: beq 18 <handle_swi+0x18>
10: ldmia sp!, {r0, r1, r2, r3, ip}
14: movs pc, lr
18: ldr r0, [r1, #8]
1c: cmp r0, #0 ; 0x0
20: ldr r1, [r1, #4]
24: ble 44 <handle_swi+0x44>
28: mov r2, #0 ; 0x0
2c: mov ip, #134217728 ; 0x8000000
30: ldrb r3, [r2, r1]
34: add r2, r2, #1 ; 0x1
38: cmp r0, r2
3c: str r3, [ip]
40: bne 30 <handle_swi+0x30>
44: mov r0, #0 ; 0x0
48: b 10 <handle_swi+0x10>

Upon entry, r0 is among a few registers that is saved on the stack. Upon return, at addresses 0x10 and 0x14, register r0 and the other registers are restored. As such the return value, which is set at address 0x44, has been clobbered.

After some more reading on GCC function attributes, it appeared that attribute naked could be used, but we need to insert inline assembly code for the prologue and epilogue ourselves. Somebody in the GNUARM mailing list thread also suggested something along this line. That is, we would do something like this instead:

static int *out = (int *) 0x08000000;

int __attribute__((naked))
handle_swi(int reason, void *args) {

asm("stmdb sp!,{r4-r11,ip,lr}");
int r, i, n, *a;
char *s;
r = 0;
if (reason != 5) r = -1;
else {
a = (int*) args;
s = (char *) a[1];
n = a[2];
for (i=0; i<n; ++i) *out = s[i];
}
asm("ldmia sp!,{r4-r11,ip,lr};movs pc,lr");
return r;
}

So much for trying to avoid assembly code in SWI handler. I choose not to save r0 to r3 because they are allowed to be clobbered in a function called. But the rest of the registers need to be saved and restored because we won't know how the C code would generally use them.

At this point, I came to the conclusion that avoiding assembly code from SWI handler is rather difficult. Actually, we need to do more than just the epilogue and prologue code. So instead of sprinkling inline assembly code in the C function, I came up with a consolidated assembly code wrapper that does the necessary checking and other preparation before calling a straight C function to handle the SWI call.

__swi_handler:
stmdb sp!,{r4,lr}

/* see if SWI argument is 0x123456 */
ldr r4,[lr,#-4]
bic r4,r4,#0xff000000
sub r4,r4,#0x00120000
sub r4,r4,#0x00003400
subs r4,r4,#0x00000056

/* save SPSR so that we have SWI reentrancy */
mrs r4,spsr
stmdb sp!,{r4}

/* only call handler if SWI argument is 0x123456 */
bleq swi_handler

/* restore SPSR */
ldmia sp!,{r4}
msr spsr,r4

ldmia sp!,{r4,pc}^

Now we just need to make sure SWI vector address calls __swi_handler. And remove the attribute modifier from swi_handler function definition.

Actually, there are more to SWI handling than just Angel SWI as can be gathered from the GNUARM thread mentioned earlier. For example, someone pointed out that divide-by-zero error will cause a SWI call. We may want to handle this properly too.