Saturday, September 02, 2006

A case of premature optimization

I had a case of premature optimization recently. I need to learn more self-control: to take it easy and let the compiler does what it does best.

I had a few tests I needed to write and they had to be super tight because they would run in gate level simulation (GLS) where time is a luxury the verification team do not have.

My other more normal test programs make use of the newlib library and I have implemented system clock and console to make writing test programs more palatable. However adding newlib makes test programs big and slow to complete in a simulation environment. So, for the GLS, I decided to make the test programs as lean as possible. This means not using newlib, or any of the libraries for that matter. Which also means no standard C library and the math emulation (because ARM7 do not have math coprocessor and does not have divide instruction.)

As an alternative to using printf and friends, I wrote a function to write to a GPIO port instead:

static void lean_vout(char *s, ...) {
va_list ap;
va_start(ap, s);
lean_out(s);
while ( (s=va_arg(ap,char*)) != 0 )
lean_out(s);
va_end(ap);
}

Where lean_out prints each character in the string to the GPIO port. Printing to GPIO is fast and the verification team can easily see the characters in the simulation signal viewer.

Beside, the ability to see what is happening in a test program, the ability to do assertions is important too. I cannot use the facility in the <assert.h> because it makes use of printf. So, I have this macro for my lean-and-mean assertion:

#define lean_assert(cond) if (!(cond)) lean_do_assert(__FILE__, __LINE__);

Basically, when the assertion fails, the filename and the line number would scuttle through the GPIO port. But there is a small problem: I need to convert the line number into its ASCII representation. Because, recently, I had a program failing to compile because I use division with -nostdlib, I thought I needed to use an algorithm to divide by 10 without actually using the division operator. And find I did: Section 10 to the "Hacker's Delight" book which talks about division by constant in great depth. Based on the information in this section, I came up with my version of itoa for converting integer to base-10 ASCII representation:

/* taken straight from "Hacker's Delight" */

static int remu10(unsigned n) {
static char table[16] =
{0, 1, 2, 2, 3, 3, 4, 5,
5, 6, 7, 7, 8, 8, 9, 0};
n = (0x19999999*n + (n >> 1) + (n >> 3)) >> 28;
return table[n];
}

char *lean_itoa(int n, char *s, int len) {
int r;
s += len;
*--s = '\0';
while (n && len > 0) {
r = remu10(n);
n = ((n - r) >> 1)*0xCCCCCCCD; /* also from "Hacker's Delight" */
*--s = '0' + r;
--len;
}
return s;
}

Basically, the remainder is computed and the divide-by-10 is done using the remainder. The generated code from arm-elf-gcc (with optimization -O2):

0000820c <lean_itoa>:
820c: e0821001 add r1, r2, r1
8210: e3a03000 mov r3, #0 ; 0x0
8214: e3500000 cmp r0, #0 ; 0x0
8218: 13520000 cmpne r2, #0 ; 0x0
821c: e92d4010 stmdb sp!, {r4, lr}
8220: e1a0c002 mov ip, r2
8224: e5413001 strb r3, [r1, #-1]
8228: e2411001 sub r1, r1, #1 ; 0x1
822c: da000018 ble 8294 <lean_itoa+0x88>
8230: e59f4064 ldr r4, [pc, #100] ; 829c <.text+0x27c>
8234: e1a0e001 mov lr, r1
8238: e0803080 add r3, r0, r0, lsl #1
823c: e0803103 add r3, r0, r3, lsl #2
8240: e0633303 rsb r3, r3, r3, lsl #6
8244: e0803103 add r3, r0, r3, lsl #2
8248: e0633703 rsb r3, r3, r3, lsl #14
824c: e0803183 add r3, r0, r3, lsl #3
8250: e08330a0 add r3, r3, r0, lsr #1
8254: e08331a0 add r3, r3, r0, lsr #3
8258: e7d41e23 ldrb r1, [r4, r3, lsr #28]
825c: e0612000 rsb r2, r1, r0
8260: e1a020c2 mov r2, r2, asr #1
8264: e0823082 add r3, r2, r2, lsl #1
8268: e0833203 add r3, r3, r3, lsl #4
826c: e0833403 add r3, r3, r3, lsl #8
8270: e0833803 add r3, r3, r3, lsl #16
8274: e0820103 add r0, r2, r3, lsl #2
8278: e24cc001 sub ip, ip, #1 ; 0x1
827c: e2811030 add r1, r1, #48 ; 0x30
8280: e3500000 cmp r0, #0 ; 0x0
8284: 135c0000 cmpne ip, #0 ; 0x0
8288: e56e1001 strb r1, [lr, #-1]!
828c: caffffe9 bgt 8238 <lean_itoa+0x2c>
8290: e1a0100e mov r1, lr
8294: e1a00001 mov r0, r1
8298: e8bd8010 ldmia sp!, {r4, pc}
829c: 0001085b andeq r0, r1, fp, asr r8

From the look of it, the compiler has tried very hard to use additions and shifts to perform the constant multiplications. This led me into thinking that, maybe, the compiler would be clever enough to avoid doing division in division by constants (of course, it is but it didn't occur to my thick head.) I tried this instead:

char *lean_itoa(int n, char *s, int len) {
int r;
s += len;
*--s = '\0';
while (n && len > 0) {
r = n % 10;
n = n / 10;
*--s = '0' + r;
--len;
}
return s;
}

And the generated assembly code:

0000820c <lean_itoa>:
820c: e0821001 add r1, r2, r1
8210: e3a03000 mov r3, #0 ; 0x0
8214: e3500000 cmp r0, #0 ; 0x0
8218: 13520000 cmpne r2, #0 ; 0x0
821c: e92d4010 stmdb sp!, {r4, lr}
8220: e5413001 strb r3, [r1, #-1]
8224: e1a0e002 mov lr, r2
8228: e2411001 sub r1, r1, #1 ; 0x1
822c: da00000f ble 8270 <lean_itoa+0x64>
8230: e59f4040 ldr r4, [pc, #64] ; 8278 <.text+0x258>
8234: e1a0c001 mov ip, r1
8238: e0c13094 smull r3, r1, r4, r0
823c: e1a03fc0 mov r3, r0, asr #31
8240: e0633141 rsb r3, r3, r1, asr #2
8244: e1a02003 mov r2, r3
8248: e0833103 add r3, r3, r3, lsl #2
824c: e0403083 sub r3, r0, r3, lsl #1
8250: e24ee001 sub lr, lr, #1 ; 0x1
8254: e2833030 add r3, r3, #48 ; 0x30
8258: e3520000 cmp r2, #0 ; 0x0
825c: 135e0000 cmpne lr, #0 ; 0x0
8260: e1a00002 mov r0, r2
8264: e56c3001 strb r3, [ip, #-1]!
8268: cafffff2 bgt 8238 <lean_itoa+0x2c>
826c: e1a0100c mov r1, ip
8270: e1a00001 mov r0, r1
8274: e8bd8010 ldmia sp!, {r4, pc}
8278: 66666667 strvsbt r6, [r6], -r7, ror #12

So, yes! The compiler avoids calling the division function. The previous generated code has 36 instructions with one load instruction for the table. The second one has 27 instructions with one load instruction used in the "signed multiply long" instruction. Which one is faster? I presume the compiler concluded that using the long multiplication is faster than additions-and-shifts (See also Re: Optimization question). But surely the second code is more compact.

This led me into digging further on how long divide-by-constant optimization has been around in GCC. Apparently, since 1994 when Torbjorn Granlund and Peter L. Montgomery wrote about it in their paper, "Division by Invariant Integers using Multiplication."

So there you have it. A case of trying to hard. I should've let GCC do its job. But I have learned a lot from the whole process.