That time I implemented objc_msgSend in PowerPC assembler

03 September 2019 by Lincoln Ramsay

A little while back, I worked for a company that was using Objective-C on Linux. They were upgrading their toolchain to take advantage of newer language features, like C++11, Objective-C++ and Objective-C 2.0.

One of the key things in the Objective-C language is that objects (classes) can be mutated at runtime. So you can’t just compile in function calls the way you’d do in C or C++. Instead, you have to look up the actual function (IMP) that the selected method (SEL) is connected to and then call it. Since this has to be done for every method call, you really want this process to be as quick as possible. The Objective-C runtime defines a method, objc_msgSend that is responsible for this. You can call it explicitly, but normally the compiler calls it for you. Like so.

// ret = [object method:arg];
ret = objc_msgSend(object, @selector(method:), arg);

Since this needs to be as fast as possible, it is implemented in assembler, as a trampoline (at least in the fast case). I found that libobjc2 had implementations of this method for several architectures, but not for PowerPC.

Since we needed this to work on a PowerPC system, I implemented objc_msgSend in PowerPC assembler, plus a few other bits to make everything work. The patch I wrote is 007-libobjc2-ppc.patch. I am quite proud of that code. I think it manages to be very readable, even if you’re not an expert in PowerPC or assembler. Nevertheless, I’m going to break it down a bit and explain things some more.

First of all, I should explain a bit about the PowerPC architecture.

PowerPC (at least the MPC5121e we were targeting) is a 32-bit architecture. It’s got 32 general purpose registers, a bunch of floating point registers, plus some specialty registers. If you want more detail, you might start here.

The PowerPC System V ABI defines standards like how to call a function, setup a stack frame and so on. It reserves some of the general purpose registers. For example, r1 is the stack frame pointer. It also defines which registers can be modified by a function and which one should be preserved. Because objc_msgSend is implemented in assembler, but is called by, and (sometimes) calls into C code, it’s important that it does not violate the ABI.

By convention, PowerPC registers are referred to by number. That can get confusing because it’s not always immediately obvious if a number is a constant or a register. It’s even worse for register 0 because for some instructions, 0 means a number but 1-31 means a register! I wanted my code to be more readable so I setup some aliases. These are the registers that my code touches.

// Access registers by name
#define  r0  0
#define  sp  1 // r1 is the stack/frame pointer
#define  r3  3
#define  r4  4
#define  r5  5
#define  r6  6
#define  r7  7
#define  r8  8
#define  r9  9
#define r10 10
#define r11 11
#define r12 12
#define  f1  1
#define  f2  2
#define  f3  3
#define  f4  4
#define  f5  5
#define  f6  6
#define  f7  7
#define  f8  8
#define  f9  9
#define f10 10
#define f11 11
#define f12 12
#define f13 13

Since the implementation of objc_msgSend is going to read things out of objects, we need to know some offsets.

// objc_class layout
#define ISA                 0
#define DTABLE              32

// objc_selector layout
#define INDEX               0

// objc_slot layout
#define METHOD              16

// SparseArray layout
#define SHIFT               4
#define DATA                12

And there’s one final constant, which is related to an optimization that I’ll discuss later.

// Used to identify small objects
#define SMALLOBJ_MASK       31

It turns out there’s actually 3 objc_msgSend functions, differentiated by return type. On PowerPC, the regular and fpret implementations aren’t any different, so they can share code. Here, I create symbols for both function names at the same address so that they literally share the same implementation. I didn’t even know this was a thing, but it’s what the ARM implementation does.

 * id        objc_msgSend(id receiver, SEL selector, ...);
 * double    objc_msgSend_fpret(id receiver, SEL selector, ...);
 * On entry: r3 is the message receiver
 *           r4 is the selector
 *           r5-r10 are the arguments
.globl CDECL(objc_msgSend)
.globl CDECL(objc_msgSend_fpret)
TYPE_DIRECTIVE(CDECL(objc_msgSend), @function)
TYPE_DIRECTIVE(CDECL(objc_msgSend_fpret), @function)
    DoMsgSend r3, r4

The third version returns a structure. Since that’s too big to fit into a register, this variant causes all the registers to be moved over by one.

 * struct_type  objc_msgSend_stret(id receiver, SEL selector, ...);
 * On entry: r3 is the address of the structure being returned
 *           r4 is the message receiver
 *           r5 is the selector
 *           r6-r10 are the arguments
.globl CDECL(objc_msgSend_stret)
TYPE_DIRECTIVE(CDECL(objc_msgSend_stret), @function)
    DoMsgSend r4, r5

Note how the above used DoMsgSend? I was able to implement objc_msgSend as a macro since it was virtually identical. There ends up being 2 copies of the code in the binary, but I only had to write 1 implementation.

DoMsgSend does a lookup on the IMP cache. You can see a C version of the dtable lookup logic in sarray2.h (the SparseArrayLookup function).

My goal was to implement objc_msgSend using only registers I was allowed to modify, in order to avoid the need to save and restore registers. The ABI says I can use r0, r11 and r12, so I restricted myself to using only these registers (on the fast path).

 * DoMsgSend RR, SR
 * Locate the IMP for a receiver and selector, then branch to it.
 * On entry: RR is r3 or r4 (receiver)
 *           SR is r4 or r5 (selector)
.macro DoMsgSend RR, SR
    cmplwi  \RR, 0                  // if (receiver == nil)
    beqlr-                          // if so, return (unlikely)

    GetClass \RR                    // r12 = class of receiver
    lwz     r12, DTABLE(r12)        // r12 = class->dtable
    lwz     r11,  SHIFT(r12)        // r11 = dtable->shift
    lwz     r12,   DATA(r12)        // r12 = dtable->data

    // When shifting, we do >>shift but then we have to multiply
    // by sizeof(void*) to get a byte offset, which is the same
    // as <<2 so we just >>(shift-2), or <<2 for dtable8

    cmplwi  r11, 8                  // if (shift == 8)
    beq     11f                     // goto dtable16
    cmplwi  r11, 0                  // if (shift == 0)
    beq     12f                     // goto dtable8

    // ASSUME shift == 16
    // The C implementation also supports shift == 24 but the other
    // ASM implementations don't handle this.

    lis     r0, 0xff                // r0 = 0xff0000
    lwz     r11, INDEX(\SR)         // r11 = selector->index
    and     r11, r11, r0            // r11 = r11 & r0
    srwi    r11, r11, 14            // r11 = r11 >> (shift-2)
    lwzx    r12, r12, r11           // dtable = data[r11]
    lwz     r12, DATA(r12)          // r12 = dtable->data

11:                                 // dtable16
    li      r0, 0                   // r0 = 0 (cannot li 0xff00)
    ori     r0, r0, 0xff00          // but we can or 0xff00
    lwz     r11, INDEX(\SR)         // r11 = selector->index
    and     r11, r11, r0            // r11 = r11 & r0
    srwi    r11, r11, 6             // r11 = r11 >> (shift-2)
    lwzx    r12, r12, r11           // dtable = data[r11]
    lwz     r12, DATA(r12)          // r12 = dtable->data

12:                                 // dtable8
    li      r0, 0xff                // r0 = 0xff
    lwz     r11, INDEX(\SR)         // r11 = selector->index
    and     r11, r11, r0            // r11 = r11 & r0
    slwi    r11, r11, 2             // r11 = r11 << 2 (>> (shift-2))
    lwzx    r12, r12, r11           // slot = data[r11]

    cmplwi  r12, 0                  // if (slot == 0)
    beq-    13f                     // goto slow_path (unlikely)
    lwz     r12, METHOD(r12)        // r12 = slot->method
    mtctr   r12
    bctr                            // goto *imp

13:                                 // slow_path
    SlowMethodLookup \RR

Even if you can’t understand the opcodes, those comments should make it clear what’s going on here.

Note that in order for this to function as a trampoline, the link register needs to be preserved. Because of the way PowerPC instructions work, it’s simply not possible to jump to arbitrary memory locations without first loading them into one of 2 registers. That’s why the code uses the count register.

Since you can mutate a class and not an instance, the IMP cache lives on the Class. The receiver (an object) has a pointer to its Class. There’s an optimization we have to watch out for where some classes are stored in a global table.

 * GetClass RR
 * Get the Class pointer for a receiver. Handles small objects.
 * On entry: RR is r3 or r4 (receiver)
 * On exit:  r12 = class of receiver
.macro GetClass RR
    clrlwi  r12, \RR, SMALLOBJ_MASK // Clear all but the LSB (RR & SMALLOBJ_MASK)
    cmplwi  r12, 0                  // if (lsb != 0)
    bne-    1f                      // goto smallobject (unlikely)

    lwz     r12, ISA(\RR)           // r12 = receiver->isa
    b       2f                      // goto end

1:                                  // smallobject
    mflr    r0                      // save lr
    bl      GetGOT                  // fetch GOT into r12
    mtlr    r0                      // restore lr

    lwz     r12, SmallObjectClasses@got(r12)
    lwz     r12, 0(r12)             // r12 = SmallObjectClasses[0]

2:                                  // end

I mentioned above that the link register is used for returning, and that it needs to be preserved. You can see here that I save the link register into r0 before “calling” GetGOT and restore it afterwards.

GetGOT is a little bit strange… The address of the GOT needs to be computed, and for that we need to make a method call. So I have a method that can be called, which computes the address of the GOT. If you want to read more about why this particular hoop needs to be jumped through, see Getting the absolute address of the GLOBAL_OFFSET_TABLE symbol on powerpc32.

The main bit of magic is getting the program counter. While PowerPC keeps track of the instruction it’s running, it doesn’t just let you fetch this. The code works around this by doing a short relative branch, saving LR (it ends up being the same value that PC is).

 * GetGOT
 * Get the GLOBAL_OFFSET_TABLE address.
 * On exit:  r12 = GOT
    mflr    r11                     // save lr
    bcl     20, 31, GetGOTLabel     // get PC
    mflr    r12                     // into r12
    // now add the fixed (at runtime) offset to PC
    addis   r12, r12, _GLOBAL_OFFSET_TABLE_ - GetGOTLabel@h
    addi    r12, r12, _GLOBAL_OFFSET_TABLE_ - GetGOTLabel@l
    mtlr    r11                     // restore lr
    blr                             // return

Now we get to the most annoying part. Calling the slow path. If the IMP wasn’t in the cache, we have to call a C method. This is where following that ABI really matters. This is where we finally see a difference in behaviour for that third objc_msgSend call. We need to move arguments because we’re calling a function that doesn’t return a struct.

I recall having some issues getting the stack frame offsets correct for the registers. I ended up defining all the offsets as constants, which helped to make the code easy to read as a bonus.

 * SlowMethodLookup RR
 * On entry: RR is r3 or r4 (receiver)
.macro SlowMethodLookup RR
    // objc_msgSend functions like a trampoline. It's called like a C function
    // but it invokes the destination function without following calling conventions
    // so that the destination function returns to the parent directly.
    // However, we are about to invoke a C function so we have to setup a stack
    // frame and because we don't know how many arguments were passed, we
    // have to save all the argument registers.

.cfi_startproc                      // generate unwind data

    // Save lr to the lr save word in the previous stack frame
    mflr    r0
    stw     r0,   4(sp)

// The first 2 words of the stack frame are for SP and LP
// Int registers take 4 bytes each
// Float registers take 8 bytes each
// STACK_SIZE is the last slot + size of that slot, rounded up to a multiple of 16
#define SLOT_1     8
#define SLOT_2     12
#define SLOT_3     16
#define SLOT_4     20
#define SLOT_5     24
#define SLOT_6     28
#define SLOT_7     32
#define SLOT_8     36
#define SLOT_F1    40
#define SLOT_F2    48
#define SLOT_F3    56
#define SLOT_F4    64
#define SLOT_F5    72
#define SLOT_F6    80
#define SLOT_F7    88
#define SLOT_F8    96
#define SLOT_F9    104
#define SLOT_F10   112
#define SLOT_F11   120
#define SLOT_F12   128
#define SLOT_F13   136
#define STACK_SIZE 144

    stwu    sp, -STACK_SIZE(sp)     // setup the stack frame

    stw     r3,  SLOT_1(sp)         // save function arguments onto the stack
    stw     r4,  SLOT_2(sp)
    stw     r5,  SLOT_3(sp)
    stw     r6,  SLOT_4(sp)
    stw     r7,  SLOT_5(sp)
    stw     r8,  SLOT_6(sp)
    stw     r9,  SLOT_7(sp)
    stw     r10, SLOT_8(sp)
    stfd    f1,  SLOT_F1(sp)
    stfd    f2,  SLOT_F2(sp)
    stfd    f3,  SLOT_F3(sp)
    stfd    f4,  SLOT_F4(sp)
    stfd    f5,  SLOT_F5(sp)
    stfd    f6,  SLOT_F6(sp)
    stfd    f7,  SLOT_F7(sp)
    stfd    f8,  SLOT_F8(sp)
    stfd    f9,  SLOT_F9(sp)
    stfd    f10, SLOT_F10(sp)
    stfd    f11, SLOT_F11(sp)
    stfd    f12, SLOT_F12(sp)
    stfd    f13, SLOT_F13(sp)

    // We are going to invoke IMP slowMsgLookup(id *receiver, SEL cmd)
    // Pass the address of the receiver arg (on the stack)
.if \RR == r3
    addi    r3, sp, SLOT_1
    addi    r3, sp, SLOT_2          // receiver was in r4
    mr      r4, r5                  // copy the selector to r4

    // If we objc_msgSend an unknown object, we'll calls things like
    // +initialize and +load from within this stack frame and if they
    // throw an exception, libunwind needs to know how to unwind this
    // frame.
.cfi_adjust_cfa_offset STACK_SIZE   // size of the frame
.cfi_offset lr,   4                 // how to restore lr

    bl      CDECL(slowMsgLookup)    // returns the IMP in r3
    mtctr   r3                      // copy IMP to ctr

    lwz     r3,  SLOT_1(sp)         // restore function arguments from the stack
    lwz     r4,  SLOT_2(sp)
    lwz     r5,  SLOT_3(sp)
    lwz     r6,  SLOT_4(sp)
    lwz     r7,  SLOT_5(sp)
    lwz     r8,  SLOT_6(sp)
    lwz     r9,  SLOT_7(sp)
    lwz     r10, SLOT_8(sp)
    lfd     f1,  SLOT_F1(sp)
    lfd     f2,  SLOT_F2(sp)
    lfd     f3,  SLOT_F3(sp)
    lfd     f4,  SLOT_F4(sp)
    lfd     f5,  SLOT_F5(sp)
    lfd     f6,  SLOT_F6(sp)
    lfd     f7,  SLOT_F7(sp)
    lfd     f8,  SLOT_F8(sp)
    lfd     f9,  SLOT_F9(sp)
    lfd     f10, SLOT_F10(sp)
    lfd     f11, SLOT_F11(sp)
    lfd     f12, SLOT_F12(sp)
    lfd     f13, SLOT_F13(sp)

    lwz     sp, 0(sp)               // pop the stack frame
    lwz     r0, 4(sp)               // restore lr
    mtlr    r0
.cfi_endproc                        // generate unwind data

    bctr                            // goto *imp

As with the fast path, we finish by loading the IMP into the count register so that we can call it like a trampoline.

One of the features of Objective-C 2.0 is support for blocks (lambdas for C). There’s a helper to turn a block into an IMP (since so many existing APIs work with IMPs rather than blocks). Note the 2 variants again (regular vs structure return).

// block_to_imp sets up the memory like this: [ block_invoker ] [ blockptr ] [ trampoline ]
// Called as trampoline(receiver, selector, ...)
// Change this to block_invoker(blockptr, receiver, ...)
// RR = r3 or r4
// SR = r4 or r5
.macro trampoline RR, SR
   mflr    r0                  // save lr
   bcl     20, 31, $+4         // get PC
   mflr    r11                 // into r11
   mtlr    r0                  // restore lr

   mr      \SR, \RR            // move receiver to SR
   lwz     \RR, -12(r11)       // RR = block pointer (3 words before r11)
   lwz     r0, -16(r11)        // load block_invoker (4 words before r11)
   mtctr   r0                  // into the counter register
   bctr                        // goto *block_invoker

   trampoline r3, r4
   trampoline r4, r5           // r3 holds the structure address

Hmm… This has an alternate implementation of that “get PC” logic that uses a relative branch instead of a label. Same numer of instructions though.

In theory, your system should have a working __clear_cache, but that wasn’t the case for me. I had to implement it so that block_to_imp could actually work.

// block_to_imp.c sets up the trampoline in heap-allocated memory but for this
// to work, __clear_cache needs to instruct the CPU to clear the memory from
// the instruction cache. But libgcc's __clear_cache doesn't do anything!
// This is an implementation of __clear_cache that actually works.
// r3 = start of trampoline
// r4 = end of trampoline
.globl CDECL(__clear_cache)
TYPE_DIRECTIVE(CDECL(__clear_cache), @function)
1:                             // LOOP:
   dcbst   r0, r3              // flush data cache line
   icbi    r0, r3              // flush instruction cache line
   addi    r3, r3, 8           // r3 += CACHE_LINE_SIZE
   cmpld   r3, r4              // if (r3 <= r4)
   ble     1b                  // goto LOOP
   sync                        // ensure dcbst completes
   isync                       // ensure icbi completes
   blr                         // return

The final piece in the puzzle was to get Clang to actually call objc_msgSend for PowerPC. By default it was inserting code to call the slow path all the time (ew). Adding the flag -fno-objc-legacy-dispatch got it to call objc_msgSend.

With all of that done, I was able to get the libobjc2 tests running (at least as well as they did on other architectures, not all of the tests were passing for us). Also, some actual code that used objc_msgSend and blocks was able to run.