That time I implemented objc_msgSend in PowerPC assembler
A little while back, I worked for a company that was using Objective-C on Linux. They were upgrading their toolchain to take advantage of newer language features, like C++11, Objective-C++ and Objective-C 2.0.
One of the key things in the Objective-C language is that objects (classes) can be mutated at runtime. So you can’t just compile in function calls the way you’d do in C or C++. Instead, you have to look up the actual function (IMP) that the selected method (SEL) is connected to and then call it. Since this has to be done for every method call, you really want this process to be as quick as possible. The Objective-C runtime defines a method, objc_msgSend that is responsible for this. You can call it explicitly, but normally the compiler calls it for you. Like so.
// ret = [object method:arg];
ret = objc_msgSend(object, @selector(method:), arg);
Since this needs to be as fast as possible, it is implemented in assembler, as a trampoline (at least in the fast case). I found that libobjc2 had implementations of this method for several architectures, but not for PowerPC.
Since we needed this to work on a PowerPC system, I implemented objc_msgSend in PowerPC assembler, plus a few other bits to make everything work. The patch I wrote is 007-libobjc2-ppc.patch. I am quite proud of that code. I think it manages to be very readable, even if you’re not an expert in PowerPC or assembler. Nevertheless, I’m going to break it down a bit and explain things some more.
First of all, I should explain a bit about the PowerPC architecture.
PowerPC (at least the MPC5121e we were targeting) is a 32-bit architecture. It’s got 32 general purpose registers, a bunch of floating point registers, plus some specialty registers. If you want more detail, you might start here.
The PowerPC System V ABI defines standards like how to call a function, setup a stack frame and so on. It reserves some of the general purpose registers. For example, r1 is the stack frame pointer. It also defines which registers can be modified by a function and which one should be preserved. Because objc_msgSend is implemented in assembler, but is called by, and (sometimes) calls into C code, it’s important that it does not violate the ABI.
By convention, PowerPC registers are referred to by number. That can get confusing because it’s not always immediately obvious if a number is a constant or a register. It’s even worse for register 0 because for some instructions, 0 means a number but 1-31 means a register! I wanted my code to be more readable so I setup some aliases. These are the registers that my code touches.
// Access registers by name
#define r0 0
#define sp 1 // r1 is the stack/frame pointer
#define r3 3
#define r4 4
#define r5 5
#define r6 6
#define r7 7
#define r8 8
#define r9 9
#define r10 10
#define r11 11
#define r12 12
#define f1 1
#define f2 2
#define f3 3
#define f4 4
#define f5 5
#define f6 6
#define f7 7
#define f8 8
#define f9 9
#define f10 10
#define f11 11
#define f12 12
#define f13 13
Since the implementation of objc_msgSend is going to read things out of objects, we need to know some offsets.
// objc_class layout
#define ISA 0
#define DTABLE 32
// objc_selector layout
#define INDEX 0
// objc_slot layout
#define METHOD 16
// SparseArray layout
#define SHIFT 4
#define DATA 12
And there’s one final constant, which is related to an optimization that I’ll discuss later.
// Used to identify small objects
#define SMALLOBJ_MASK 31
It turns out there’s actually 3 objc_msgSend functions, differentiated by return type. On PowerPC, the regular and fpret implementations aren’t any different, so they can share code. Here, I create symbols for both function names at the same address so that they literally share the same implementation. I didn’t even know this was a thing, but it’s what the ARM implementation does.
/********************************************************************
* id objc_msgSend(id receiver, SEL selector, ...);
* double objc_msgSend_fpret(id receiver, SEL selector, ...);
*
* On entry: r3 is the message receiver
* r4 is the selector
* r5-r10 are the arguments
********************************************************************/
.globl CDECL(objc_msgSend)
.globl CDECL(objc_msgSend_fpret)
TYPE_DIRECTIVE(CDECL(objc_msgSend), @function)
TYPE_DIRECTIVE(CDECL(objc_msgSend_fpret), @function)
CDECL(objc_msgSend):
CDECL(objc_msgSend_fpret):
DoMsgSend r3, r4
The third version returns a structure. Since that’s too big to fit into a register, this variant causes all the registers to be moved over by one.
/********************************************************************
* struct_type objc_msgSend_stret(id receiver, SEL selector, ...);
*
* On entry: r3 is the address of the structure being returned
* r4 is the message receiver
* r5 is the selector
* r6-r10 are the arguments
********************************************************************/
.globl CDECL(objc_msgSend_stret)
TYPE_DIRECTIVE(CDECL(objc_msgSend_stret), @function)
CDECL(objc_msgSend_stret):
DoMsgSend r4, r5
Note how the above used DoMsgSend? I was able to implement objc_msgSend as a macro since it was virtually identical. There ends up being 2 copies of the code in the binary, but I only had to write 1 implementation.
DoMsgSend does a lookup on the IMP cache. You can see a C version of the dtable lookup logic in sarray2.h (the SparseArrayLookup function).
My goal was to implement objc_msgSend using only registers I was allowed to modify, in order to avoid the need to save and restore registers. The ABI says I can use r0, r11 and r12, so I restricted myself to using only these registers (on the fast path).
/********************************************************************
* DoMsgSend RR, SR
*
* Locate the IMP for a receiver and selector, then branch to it.
*
* On entry: RR is r3 or r4 (receiver)
* SR is r4 or r5 (selector)
********************************************************************/
.macro DoMsgSend RR, SR
cmplwi \RR, 0 // if (receiver == nil)
beqlr- // if so, return (unlikely)
GetClass \RR // r12 = class of receiver
lwz r12, DTABLE(r12) // r12 = class->dtable
lwz r11, SHIFT(r12) // r11 = dtable->shift
lwz r12, DATA(r12) // r12 = dtable->data
// When shifting, we do >>shift but then we have to multiply
// by sizeof(void*) to get a byte offset, which is the same
// as <<2 so we just >>(shift-2), or <<2 for dtable8
cmplwi r11, 8 // if (shift == 8)
beq 11f // goto dtable16
cmplwi r11, 0 // if (shift == 0)
beq 12f // goto dtable8
// ASSUME shift == 16
// The C implementation also supports shift == 24 but the other
// ASM implementations don't handle this.
lis r0, 0xff // r0 = 0xff0000
lwz r11, INDEX(\SR) // r11 = selector->index
and r11, r11, r0 // r11 = r11 & r0
srwi r11, r11, 14 // r11 = r11 >> (shift-2)
lwzx r12, r12, r11 // dtable = data[r11]
lwz r12, DATA(r12) // r12 = dtable->data
11: // dtable16
li r0, 0 // r0 = 0 (cannot li 0xff00)
ori r0, r0, 0xff00 // but we can or 0xff00
lwz r11, INDEX(\SR) // r11 = selector->index
and r11, r11, r0 // r11 = r11 & r0
srwi r11, r11, 6 // r11 = r11 >> (shift-2)
lwzx r12, r12, r11 // dtable = data[r11]
lwz r12, DATA(r12) // r12 = dtable->data
12: // dtable8
li r0, 0xff // r0 = 0xff
lwz r11, INDEX(\SR) // r11 = selector->index
and r11, r11, r0 // r11 = r11 & r0
slwi r11, r11, 2 // r11 = r11 << 2 (>> (shift-2))
lwzx r12, r12, r11 // slot = data[r11]
cmplwi r12, 0 // if (slot == 0)
beq- 13f // goto slow_path (unlikely)
lwz r12, METHOD(r12) // r12 = slot->method
mtctr r12
bctr // goto *imp
13: // slow_path
SlowMethodLookup \RR
.endm
Even if you can’t understand the opcodes, those comments should make it clear what’s going on here.
Note that in order for this to function as a trampoline, the link register needs to be preserved. Because of the way PowerPC instructions work, it’s simply not possible to jump to arbitrary memory locations without first loading them into one of 2 registers. That’s why the code uses the count register.
Since you can mutate a class and not an instance, the IMP cache lives on the Class. The receiver (an object) has a pointer to its Class. There’s an optimization we have to watch out for where some classes are stored in a global table.
/********************************************************************
* GetClass RR
*
* Get the Class pointer for a receiver. Handles small objects.
*
* On entry: RR is r3 or r4 (receiver)
* On exit: r12 = class of receiver
********************************************************************/
.macro GetClass RR
clrlwi r12, \RR, SMALLOBJ_MASK // Clear all but the LSB (RR & SMALLOBJ_MASK)
cmplwi r12, 0 // if (lsb != 0)
bne- 1f // goto smallobject (unlikely)
lwz r12, ISA(\RR) // r12 = receiver->isa
b 2f // goto end
1: // smallobject
mflr r0 // save lr
bl GetGOT // fetch GOT into r12
mtlr r0 // restore lr
lwz r12, SmallObjectClasses@got(r12)
lwz r12, 0(r12) // r12 = SmallObjectClasses[0]
2: // end
.endm
I mentioned above that the link register is used for returning, and that it needs to be preserved. You can see here that I save the link register into r0 before “calling” GetGOT and restore it afterwards.
GetGOT is a little bit strange… The address of the GOT needs to be computed, and for that we need to make a method call. So I have a method that can be called, which computes the address of the GOT. If you want to read more about why this particular hoop needs to be jumped through, see Getting the absolute address of the GLOBAL_OFFSET_TABLE symbol on powerpc32.
The main bit of magic is getting the program counter. While PowerPC keeps track of the instruction it’s running, it doesn’t just let you fetch this. The code works around this by doing a short relative branch, saving LR (it ends up being the same value that PC is).
/********************************************************************
* GetGOT
*
* Get the GLOBAL_OFFSET_TABLE address.
* GOT = PC + (_GLOBAL_OFFSET_TABLE_ - GetGOTLabel)
*
* On exit: r12 = GOT
********************************************************************/
GetGOT:
mflr r11 // save lr
bcl 20, 31, GetGOTLabel // get PC
GetGOTLabel:
mflr r12 // into r12
// now add the fixed (at runtime) offset to PC
addis r12, r12, _GLOBAL_OFFSET_TABLE_ - GetGOTLabel@h
addi r12, r12, _GLOBAL_OFFSET_TABLE_ - GetGOTLabel@l
mtlr r11 // restore lr
blr // return
Now we get to the most annoying part. Calling the slow path. If the IMP wasn’t in the cache, we have to call a C method. This is where following that ABI really matters. This is where we finally see a difference in behaviour for that third objc_msgSend call. We need to move arguments because we’re calling a function that doesn’t return a struct.
I recall having some issues getting the stack frame offsets correct for the registers. I ended up defining all the offsets as constants, which helped to make the code easy to read as a bonus.
/********************************************************************
* SlowMethodLookup RR
*
* On entry: RR is r3 or r4 (receiver)
********************************************************************/
.macro SlowMethodLookup RR
// objc_msgSend functions like a trampoline. It's called like a C function
// but it invokes the destination function without following calling conventions
// so that the destination function returns to the parent directly.
// However, we are about to invoke a C function so we have to setup a stack
// frame and because we don't know how many arguments were passed, we
// have to save all the argument registers.
.cfi_startproc // generate unwind data
// Save lr to the lr save word in the previous stack frame
mflr r0
stw r0, 4(sp)
// The first 2 words of the stack frame are for SP and LP
// Int registers take 4 bytes each
// Float registers take 8 bytes each
// STACK_SIZE is the last slot + size of that slot, rounded up to a multiple of 16
#define SLOT_1 8
#define SLOT_2 12
#define SLOT_3 16
#define SLOT_4 20
#define SLOT_5 24
#define SLOT_6 28
#define SLOT_7 32
#define SLOT_8 36
#define SLOT_F1 40
#define SLOT_F2 48
#define SLOT_F3 56
#define SLOT_F4 64
#define SLOT_F5 72
#define SLOT_F6 80
#define SLOT_F7 88
#define SLOT_F8 96
#define SLOT_F9 104
#define SLOT_F10 112
#define SLOT_F11 120
#define SLOT_F12 128
#define SLOT_F13 136
#define STACK_SIZE 144
stwu sp, -STACK_SIZE(sp) // setup the stack frame
stw r3, SLOT_1(sp) // save function arguments onto the stack
stw r4, SLOT_2(sp)
stw r5, SLOT_3(sp)
stw r6, SLOT_4(sp)
stw r7, SLOT_5(sp)
stw r8, SLOT_6(sp)
stw r9, SLOT_7(sp)
stw r10, SLOT_8(sp)
stfd f1, SLOT_F1(sp)
stfd f2, SLOT_F2(sp)
stfd f3, SLOT_F3(sp)
stfd f4, SLOT_F4(sp)
stfd f5, SLOT_F5(sp)
stfd f6, SLOT_F6(sp)
stfd f7, SLOT_F7(sp)
stfd f8, SLOT_F8(sp)
stfd f9, SLOT_F9(sp)
stfd f10, SLOT_F10(sp)
stfd f11, SLOT_F11(sp)
stfd f12, SLOT_F12(sp)
stfd f13, SLOT_F13(sp)
// We are going to invoke IMP slowMsgLookup(id *receiver, SEL cmd)
// Pass the address of the receiver arg (on the stack)
.if \RR == r3
addi r3, sp, SLOT_1
.else
addi r3, sp, SLOT_2 // receiver was in r4
mr r4, r5 // copy the selector to r4
.endif
// If we objc_msgSend an unknown object, we'll calls things like
// +initialize and +load from within this stack frame and if they
// throw an exception, libunwind needs to know how to unwind this
// frame.
.cfi_adjust_cfa_offset STACK_SIZE // size of the frame
.cfi_offset lr, 4 // how to restore lr
bl CDECL(slowMsgLookup) // returns the IMP in r3
mtctr r3 // copy IMP to ctr
lwz r3, SLOT_1(sp) // restore function arguments from the stack
lwz r4, SLOT_2(sp)
lwz r5, SLOT_3(sp)
lwz r6, SLOT_4(sp)
lwz r7, SLOT_5(sp)
lwz r8, SLOT_6(sp)
lwz r9, SLOT_7(sp)
lwz r10, SLOT_8(sp)
lfd f1, SLOT_F1(sp)
lfd f2, SLOT_F2(sp)
lfd f3, SLOT_F3(sp)
lfd f4, SLOT_F4(sp)
lfd f5, SLOT_F5(sp)
lfd f6, SLOT_F6(sp)
lfd f7, SLOT_F7(sp)
lfd f8, SLOT_F8(sp)
lfd f9, SLOT_F9(sp)
lfd f10, SLOT_F10(sp)
lfd f11, SLOT_F11(sp)
lfd f12, SLOT_F12(sp)
lfd f13, SLOT_F13(sp)
lwz sp, 0(sp) // pop the stack frame
lwz r0, 4(sp) // restore lr
mtlr r0
.cfi_endproc // generate unwind data
bctr // goto *imp
.endm
As with the fast path, we finish by loading the IMP into the count register so that we can call it like a trampoline.
One of the features of Objective-C 2.0 is support for blocks (lambdas for C). There’s a helper to turn a block into an IMP (since so many existing APIs work with IMPs rather than blocks). Note the 2 variants again (regular vs structure return).
// block_to_imp sets up the memory like this: [ block_invoker ] [ blockptr ] [ trampoline ]
// Called as trampoline(receiver, selector, ...)
// Change this to block_invoker(blockptr, receiver, ...)
// RR = r3 or r4
// SR = r4 or r5
.macro trampoline RR, SR
mflr r0 // save lr
bcl 20, 31, $+4 // get PC
mflr r11 // into r11
mtlr r0 // restore lr
mr \SR, \RR // move receiver to SR
lwz \RR, -12(r11) // RR = block pointer (3 words before r11)
lwz r0, -16(r11) // load block_invoker (4 words before r11)
mtctr r0 // into the counter register
bctr // goto *block_invoker
.endm
CDECL(__objc_block_trampoline):
trampoline r3, r4
CDECL(__objc_block_trampoline_end):
CDECL(__objc_block_trampoline_sret):
trampoline r4, r5 // r3 holds the structure address
CDECL(__objc_block_trampoline_end_sret):
Hmm… This has an alternate implementation of that “get PC” logic that uses a relative branch instead of a label. Same numer of instructions though.
In theory, your system should have a working __clear_cache, but that wasn’t the case for me. I had to implement it so that block_to_imp could actually work.
// block_to_imp.c sets up the trampoline in heap-allocated memory but for this
// to work, __clear_cache needs to instruct the CPU to clear the memory from
// the instruction cache. But libgcc's __clear_cache doesn't do anything!
// This is an implementation of __clear_cache that actually works.
// r3 = start of trampoline
// r4 = end of trampoline
.globl CDECL(__clear_cache)
TYPE_DIRECTIVE(CDECL(__clear_cache), @function)
CDECL(__clear_cache):
1: // LOOP:
dcbst r0, r3 // flush data cache line
icbi r0, r3 // flush instruction cache line
addi r3, r3, 8 // r3 += CACHE_LINE_SIZE
cmpld r3, r4 // if (r3 <= r4)
ble 1b // goto LOOP
sync // ensure dcbst completes
isync // ensure icbi completes
blr // return
The final piece in the puzzle was to get Clang to actually call objc_msgSend for PowerPC. By default it was inserting code to call the slow path all the time (ew). Adding the flag -fno-objc-legacy-dispatch got it to call objc_msgSend.
With all of that done, I was able to get the libobjc2 tests running (at least as well as they did on other architectures, not all of the tests were passing for us). Also, some actual code that used objc_msgSend and blocks was able to run.