Performance and architectural implications

Pointer auth overhead

Pointer authentication does add extra instructions (PAC and AUT) to the execution path of functions and pointer uses, but these are highly optimized. The PAC instructions use a custom block cipher (QARMA) designed to be fast in hardware, completing in a few cycles. Studies have estimated the performance cost of PAC to be on the order of 4-6 CPU cycles per signed pointer operation. In practice, this translates to negligible impact on typical workloads - one academic evaluation showed return address signing on all functions had under 0.5% runtime overhead. Apple’s own measurements are not public, but the fact that PAC is enabled by default for system software indicates the overhead is low enough to be outweighed by security benefits. Apple likely optimized their microarchitecture to handle PAC instructions efficiently in the pipeline. Indeed, many PAC instructions are treated as NOPs on older CPUs (they occupy the HINT space), meaning on PAC-capable CPUs they execute in parallel with other operations when possible. The M1’s out-of-order core can thus often execute a PAC check without delaying subsequent instructions significantly. Additionally, combined instructions like RETAA/RETAB (authenticate and return) added in ARMv8.5 reduce overhead by folding two operations into one. From a code-size perspective, PAC does increase binary size slightly (extra instructions in prologues/epilogues), but this is modest (a few percent), and as noted, these instructions vanish on non-PAC hardware due to being hint-encoded, maintaining backward compatibility.

Instruction pipeline and CPU design

The introduction of PAC/BTI influenced Apple’s CPU pipeline design. For example, the need to sign the return address at function entry means the CPU’s front-end must handle a PACIASP instruction early in most function prologues. Apple likely ensured the branch predictor and return stack knew how to handle signed return addresses. In fact, Apple’s CPUs include a custom “Pointer Authentication Buffer” as part of speculative execution to avoid mispredicting across PACed returns. The pipeline also needed minor adjustments for BTI: the CPU must check an upcoming instruction at a branch target for the BTI opcode before permitting execution to continue. This is done as part of the instruction fetch/decode stage for indirect branches, and while it adds a check, it’s a simple one-byte instruction comparison and has minimal impact on modern pipeline throughput. Apple’s M1/M2 fetch/decode logic was likely designed with BTI from the start (since ARMv8.5 was implemented), so there’s essentially no known penalty when indirect branches go to valid targets. Only invalid jumps pay a cost (and those are precisely the attacks we want to thwart).

Memory model and concurrency

The shift to the RCpc memory model (weaker ordering in specific cases) in ARMv8.3 could have implications for software, but Apple’s platforms likely saw mostly positive effects. By allowing the hardware to reorder a Store-Release and a following Load-Acquire to a different address, the CPU can optimize memory coherence traffic. This is an invisible change to well-behaved programs (it only relaxes an ordering that isn’t relied on by properly synchronized code), but it improves performance in lock-free data structures or producer-consumer scenarios by not forcing a total order unnecessarily. Apple’s macOS and iOS kernels are carefully written to ARM’s memory model; the move to RCpc did not require changes in most code but gives the Apple M1/M2 memory subsystem a bit more freedom to reduce inter-core memory stalls. The overall cache hierarchy and coherence protocol in M1/M2 are very advanced (with large private L2 caches and a system-wide LLC). ARMv8.3’s memory model tweaks help maximize utilization of these caches by loosening ordering when safe. There’s no evidence of any regressions caused by this; it’s an under-the-hood performance tuning that Apple gets “for free” by virtue of ARMv8.3 compliance.

Privilege levels and OS design

Apple’s use of ARMv8.3 did not introduce new privilege levels but it did solidify the use of EL2 in macOS for the first time. On Intel Macs, macOS had no concept of an EL2 (hypervisor mode). With Apple Silicon, macOS uses EL2 to run its Hypervisor (for virtualization and for the security monitor code like pointer authentication management in some cases). Notably, Apple chose not to use ARM’s EL3 at all: the highest level (EL3, typically used for a secure monitor in TrustZone) is unused in Apple’s design. Instead, Apple uses a separate Secure Enclave Processor for sensitive tasks. This means the main CPU runs everything in either the world of EL2 (hypervisor/XNU) and below. The decision to ignore EL3 simplifies the exception model for macOS but required that features like pointer auth, normally managed by EL3 in some schemes, be managed at EL1/EL2. Apple handled this by doing key management in the kernel and hypervisor. For instance, as mentioned, Apple added special routines to enable/disable user PAC keys around context switches at EL1, rather than relying on a secure monitor to manage keys. The architecture (ARMv8.3/8.4) gave Apple the tools needed (like new control registers and HCR_EL2 bits) to implement this cleanly in software. The upshot is macOS can run virtualization and security services without a performance hit of switching to EL3; everything stays in the fast path of EL1/EL2.

Cache and memory impacts

ARMv8.3’s FEAT_CCIDX (extended cache index) allowed for describing larger caches in system registers. Apple’s M1 Max/Ultra have massive caches, and using the 64-bit cache descriptors ensures macOS can fully utilize cache size info for optimizations (like determining effective caching strategies or core scheduling based on cache topology). There’s no direct runtime “speedup” from this feature, but it future-proofs Apple’s design as cache sizes grow. Similarly, larger virtual address space support (up to 52 bits) means macOS could, down the line, manage 4 PB virtual address spaces if needed: useful for very large-memory servers or expanded ARMv9 features. Currently, macOS uses 16KB pages and 48-bit VAs on Apple Silicon (giving 256TB space, which is plenty), but the hardware capability exists if needed for pro workflows or specialty use cases.

Impact on software ecosystem

From a developer’s perspective, targeting ARMv8.3-A (arm64e) in Xcode has some minor implications: it required using updated compilers and updating some low-level code. For example, if apps were doing unusual things with the stack or hand-writing assembly, they needed adjustments to cooperate with PAC. In rare cases, developers had to insert PAC instructions in their assembly or avoid manipulating authenticated pointers without stripping the PAC, to prevent crashes. Apple provided new APIs (in <ptrauth.h>) to manually strip, sign, or blend PACs for advanced use cases. Overall, the transition was smooth and did not degrade app performance; sin fact, many developers did not notice any difference except improved security. Apple’s Rosetta 2 translator even handles arm64e<->x86_64 transitions seamlessly for apps, ensuring that pointer auth doesn’t break translated code (by disabling PAC when running translated x86 code and re-enabling when coming back to ARM code, as needed).

In conclusion, the performance impact of ARMv8.3-A’s features on macOS has been minimal or positive. Apple’s Silicon team balanced security and speed, leveraging architectural enhancements like RCpc and efficient PAC implementations to ensure that macOS on M1/M2 feels as fast (or faster) than on previous architectures, all while running dramatically more security checks under the hood. The integration of these features showcases Apple’s ability to adopt cutting-edge ISA improvements without compromising the legendary performance per watt of their chips.