Monday, April 16, 2007

Debugging: Solaris bus error caused by taking pointer to structure member

Take a look at this sample program that fails horribly when compiled on Solaris using gcc (I haven't tried other compilers, and I'm not pointing my finger at gcc here, this is a Sun gotcha).

Here's an example program (simplified for something much more complex that I was debugging), that illustrates how memory alignment on SPARC systems can bite you if you are doing low-level things in C. In the example the program allocates space for a thing structure which will be prepended with a header. The header structure has a dummy byte array called data which will be used to reference the start of the thing.
struct thing {
  int an_int;

struct header {
  short id;
  char data[0];

struct header * maker( int size ) {
  return (struct header *)malloc( sizeof( struct header ) + size );

int main( void ) {
  struct header * a_headered_thing = maker( sizeof( struct thing ) );

  struct thing * a_thing = (struct thing *)&(a_headered_thing->data[0]);

  a_thing->an_int  = 42;

If you build this on a SPARC machine you'll get the following error when you run it:
Bus Error (core dumped)

Annoyingly, if you build a debugging version of this program the problem magically goes away and doesn't dump core in the debugger. So you either resort to printf-style debugging or going into gdb and looking at the assembly output.

Here's what happens when you run this in gdb (non-debug code):
(gdb) run

Program received signal SIGSEGV, Segmentation fault.
0x000106d8 in main ()

Since you can't get back to the source we're forced to do a little disassembly:
(gdb) disassemble
Dump of assembler code for function main:
: save %sp, -120, %sp 0x000106b4
: mov 4, %o0 0x000106b8
: call 0x10688 0x000106bc
: nop 0x000106c0
: st %o0, [ %fp + -20 ] 0x000106c4
: ld [ %fp + -20 ], %o0 0x000106c8
: add %o0, 2, %o0 0x000106cc
: st %o0, [ %fp + -24 ] 0x000106d0
: ld [ %fp + -24 ], %o1 0x000106d4
: mov 0x2a, %o0 0x000106d8
: st %o0, [ %o1 ]
: mov %o0, %i0 0x000106e0
: nop 0x000106e4
: ret 0x000106e8
: restore 0x000106ec
: retl 0x000106f0
: add %o7, %l7, %l7 End of assembler dump.

I've highlighted the offending instruction. From the code you can clearly see that the o0 register contains the value 0x2a (which is, of course, 42) and hence we are looking at code corresponding to the line a_thing->an_int = 42;. The st instruction is going to write the 42 into the an_int field of thing. The address of an_int is stored in o1.

Asking gdb for o1's value shows us:
(gdb) info registers o1
o1             0x2094a  133450

An int is 4 bytes and you can easily see that the address of an_int stored in o1 is not 4 byte aligned (133450 mod 4 = 2, or just stare at the bottom nybble). The SPARC architecture insists that the data accesses be correctly aligned for the size of the access. In this case we need 4 byte assignment (note that malloc will make sure that things are correctly aligned and the compiler will pack structures to the correct alignment while minimizing space).

In this case, the code fails because the data member is byte aligned (since we declared it as a character array), but then we take a pointer to it and treat it as structure with an integer member. Oops. Bus error.

(Note you could have discovered this with printf and %p to get the pointer values without going into the debugger and poking around in the assembly code).

There are a couple of ways to fix it. The first is to pad the header structure so that data is correctly aligned: adding 4 bytes of padding in the form of a short while make the problem go away:
struct header {
  short id;
  short padding;
  char data[0];

That's ugly and requires careful commenting and could be a maintenance problem if maker is used to make things requiring a different alignment, or the header structure is modified.

It's slightly cleaner to not have padding but change the type of data to something like the alignment you want:
struct header {
  short id;
  int data[0];

(Or even double data[0] to get 8 byte alignment). With gcc you could even make this really clear by using the aligned attribute to create a special type:
typedef char aligned_data __attribute__ ((aligned (8)));

struct header {
  short id;
  aligned_data data[0];

I think that's the clearest option of all. With a little documentation around this it should be maintainable.

1 comment:

Bart said...

It's been a number of years since I did low-level system programming in C, but I my recollection is that the maximally-portable solution to your problem is to use a union for the data field of the struct.

Of course, if you don't care about compiling for something older than c99, the aligned attribute is fine.